Mean Insert Size Detection

When working with paired-end data, DRAGEN must choose among the highest-quality alignments for the two ends to try to choose likely pairs. To make this choice, DRAGEN uses a Gaussian statistical model to evaluate the likelihood that a pair of alignments constitutes a pair. This model is based on the intuition that a particular library prep tends to create fragments of roughly similar size, thus producing pairs whose insert lengths cluster well around some mean insert length.

If you know the statistics of your library prep for an input file (and the file consists of a single read group), you can specify the characteristics of the insert-length distribution: mean, standard deviation, and three quartiles. These characteristics can be specified with the Aligner.pe-stat-mean-insert, Alinger.pe-stat-stddev-insert, Aligner.pe-stat-quartiles-insert, and Aligner.pe-stat-mean-read-len options. However, it is typically preferable to allow DRAGEN to detect these characteristics automatically.

To enable automatic sampling of the insert-length distribution, set --enable-sampling to true. When the software starts execution, it runs a sample of up to 100,000 pairs through the aligner, calculates the distribution, and then uses the resulting statistics for evaluating all pairs in the input set.

The DRAGEN host software reports the statistics in its stdout log in a report, as follows:

Final paired-end statistics detected for read group 0, based on 79935 high quality pairs for FR orientation

Quartiles (25 50 75) = 398 410 421

Mean = 410.151

Standard deviation = 14.6773

Boundaries for mean and standard deviation: low = 352, high = 467

Boundaries for proper pairs: low = 329, high = 490

NOTE: DRAGEN's insert estimates include corrections for clipping (so they are no identical to TLEN)

The insert length distribution for each sample is written to fragment_length_hist.csv. Each sample starts with the following lines

#Sample: sample name
FragmentLength,Count

These lines are followed by the histogram.

When the number of sample pairs is very small, there is not enough information to characterize the distribution with high confidence. In this case, DRAGEN applies default statistics that specify a very wide insert distribution, which tends to admit pairs of alignments as proper pairs, even if they may lie tens of thousands of bases apart. In this situation, DRAGEN outputs a message, as follows:

WARNING: Less than 28 high quality pairs found - standard deviation is calculated from the small samples formula

The small samples formula calculates standard deviation as follows:

if samples < 3 then

     standard deviation = 10000

else if samples < 28 then

    standard deviation = 25 * (standard deviation + 1) / (samples – 2)

end if

if standard deviation < 12 then

     standard deviation = 12

end if

The default model is “standard deviation = 10000”. If the first 100000 reads are unmapped or if all pairs are improper pairs, then the standard deviation is set to 10000 and the mean and quartiles are set to 0. Note that the minimum value for standard deviation is 12, which is independent of the number of samples.

For RNA-Seq data, the insert size distribution is not normal due to pairs containing introns. The DRAGEN software estimates the distribution using a kernel density estimator to fit a long tail to the samples. This estimate leads to a more accurate mean and standard deviation for RNA-Seq data and proper pairing.

DRAGEN writes detected paired-end stats into a tab-delimited log file in the output directory called .insert-stats.tab. This file contains the statistical distribution of detected insert sizes for each read group, including quartiles, mean, standard deviation, minimum, and maximum. The information matches the standard-out report above. Additionally, the log file includes the minimum and maximum insert limits that DRAGEN applied for rescue scans.