DNA Aligning

Smith-Waterman Alignment Scoring Settings

The first stage of mapping generates seeds from the read and looks for exact matches in the reference genome. The seed match results are then refined by running full Smith-Waterman alignments on the locations with the highest density of seed matches. The alignment algorithm compares each position of the read against all candidate positions of the reference. These comparisons correspond to a matrix of potential alignments between read and reference. For each candidate alignment position, Smith-Waterman algorithm generates a score while passing through the scoring matrix. The score reflects if alignment is reached by a nucleotide match or mismatch, a deletion, or an insertion. A match between read and reference provides a bonus and a mismatch or indel imposes a penalty. The alignment chosen has the overall highest scoring path through the matrix.

For an alignment with multiple possible interpretations, the specific values chosen for scores indicate how to balance the possibility of an indel as opposed to one or more SNPs or the preference for an alignment without clipping. The default DRAGEN scoring values are reasonable for aligning moderate length reads to a whole human reference genome for variant calling applications. However, any set of Smith-Waterman scoring parameters represents an imprecise model of genomic mutation and sequencing errors. Differently tuned alignment scoring values can be more appropriate for some applications.

The following options control Smith-Waterman Alignment:

Option

Description

--Aligner.global

The global option controls whether alignment is forced to be end-to-end in the read. The available values are 0 or 1.

•

When set to 1, alignments are always end-to-end, as in the Needleman-Wunsch global alignment algorithm. Alignments are not end-to-end in the reference. Alignment scores can be positive or negative.

•

When set to 0, alignments can be clipped at either or both ends of the read, as in the Smith-Waterman local alignment algorithm, and alignment scores are positive.

For long reads, the value 0 is preferred, so significant read segments after a break can be clipped without severely decreasing the alignment score. Examples of breaks include a large indel, structural variant, chimeric read, and so forth. Setting the option to 1 might not have the desired effect with longer reads, because insertions at or near the ends of a read can function as pseudoclipping. Also, when global is 0, multiple (chimeric) alignments can be reported when various portions of a read match widely separated reference positions.

For short reads, setting global to 1 is sometimes preferable. Short reads are unlikely to overlap structural breaks, unable to support chimeric alignments, and are suspected of incorrect mapping if they cannot align well end-to-end.

To make a soft preference for unclipped alignments, consider using the unclip-score option, or increasing it, instead of setting global to 1.

--Aligner.match-score

The match-score option specifies the score for a read nucleotide matching a reference nucleotide (A, C, G, or T), or matching a reference 2–3 nucleotide IUPAC-IUB code. The value is an unsigned integer from 0 to 15. Only set the match_score option to 0 when global is 1. A higher match score results in longer alignments and fewer long insertions.

--Aligner.match-n-score

The match-n-score option specifies the score for an aligned position where the read position and/or the reference position is an N code. The option is a signed integer from -16 to 15.

--Aligner.mismatch-pen

The mismatch-pen option sets the penalty, or negative score, for a read nucleotide mismatching any reference nucleotide or IUPAC-IUB code, except N. The option is an unsigned integer from 0 to 63. A higher mismatch penalty results in alignments with more insertions, deletions, and clipping to avoid SNPs.

--Aligner.gap-open-pen

The gap-open-pen option set the penalty, or negative score, for opening a gap (ie, an insertion or deletion). The value is only for a 0-base gap. The penalty is always added to the gap length multiplied by gap-ext-pen. The option is an unsigned integer from 0 to 127. A higher gap open penalty causes fewer insertions and deletions of any length in alignment CIGARs. Clipping or alignment through SNPs is used instead.

--Aligner.gap-ext-pen

The gap-ext-pen option sets the penalty, or negative score, for extending a gap (ie, an insertion or deletion) by one base. This option is an unsigned integer from 0 to 15. A higher gap extension penalty causes fewer long insertions and deletions in alignment CIGARs. Short indels, clipping, or alignment through SNPs is used instead.

--Aligner.unclip-score

The unclip-score option sets the score bonus for an alignment reaching the beginning or end of the read. A higher unclipped bonus causes alignment more often to reach the beginning and/or end of a read, where alignment can be done without too many SNPs or indels. An end-to-end alignment receives twice the bonus. The option is an unsigned integer from 0 to 127.

A nonzero unclip-score is useful when global is 0 to make a soft preference for unclipped alignments. Unclipped bonuses have little effect on alignments when global is 1, because end-to-end alignments are forced. However, 2 × unclip-score does add to every alignment score unless no-unclip-score is 1).

For longer reads, setting unclip-score much higher than gap-open-pen can result in insertions at or near one end of a read being used as pseudoclipping, as happens with global is 1.

--Aligner.no-unclip-score

The no-unclip-score option can be 0 or 1. The default is 1. When no-unclip-score is set to 1, any unclipped bonus (unclip-score) contributing to an alignment is removed from the alignment score before further processing. Unclipped bonuses can include comparison with aln-min-score, comparison with other alignment scores, and reporting in AS or XS tags. However, the unclipped bonus still affects the best-scoring alignment found by the Smith-Waterman alignment to a given reference segment, biasing toward unclipped alignments.

If a unclip-score greater than 0 causes a Smith-Waterman local alignment to extend out to one or both ends of the read, the following score changes are possible:

•

If the no-unclip-score is 0, the alignment score stays the same or increases.

•

If no-unclip-score is 1, whereas it stays the same or decreases.

The default, no-unclip-score is 1, is recommended when global is 1, because every alignment is end-to-end. There is no need to add the same bonus to every alignment.

When changing no-unclip-score, consider whether aln-min-score should be adjusted. When no-unclip-score is 0, unclipped bonuses are included in alignment scores compared to the aln-min-score floor, so the subset of alignments filtered out by aln-min-score can change significantly with no-unclip-score.

--Aligner.aln-min-score

The aln-min-score option specifies a minimum acceptable alignment score. Any alignment results scoring lower are discarded. Increasing or decreasing aln-min-score can reduce or increase the percentage of reads mapped. This option is a signed integer (negative alignment scores are possible with global=0). aln-min-score also affects MAPQ estimates. The primary contributor to MAPQ calculation is the difference between the best and second-best alignment scores. A read's best alignment score is saved in the AS SAM tag, and the second-best score (if available) is saved in the XS tag. aln-min-score serves as the suboptimal alignment score if nothing higher was found except the best score. Therefore, increasing aln-min-score can decrease reported MAPQ for some low-scoring alignments. You can use the min-score-coeff option to adjust aln-min-score as a function of read length.

--Aligner.min-score-coeff

The min-score-coeff option makes adjustments to aln-min-score per read base. When using the min-score-coeff and aln-min-score options together, you can define the minimum alignment score for each read as an affine function of read length. The minimum score for an N-base read is calculated as follows.

(min-score-coeff)*N+(aln-min-score)

The min-score-coeff option is an integer ranging from –64 to 63.999. If the value is 0, then the minimum alignment score is fixed at aln-min-score for all read lengths. You can use positive values for min-score-coeff to allow shorter reads to match with lower alignment scores, but require longer reads to achieve higher scores.

Paired-End Options

DRAGEN can process paired-end data passed via a pair of FASTQ files or in a single interleaved FASTQ file. The hardware maps the two ends separately, and then determines a set of alignments that seem most likely to form a pair in the expected orientation and having roughly the expected insert size. The alignments for the two ends are evaluated for the quality of their pairing, with larger penalties for insert sizes far from the expected size. The following options control processing of paired-end data:

Option

Description

--Aligner.pe-orientation

The pe-orientation option specifies the expected paired-end orientation. Only pairs with the orientation can be flagged as proper pairs. The following values are valid:

•

0 is FR (default)

•

1 is RF

•

2 is FF

--Aligner.unpaired-pen

For paired-end reads, the best mapping positions are determined jointly for each pair. The positions are evaluated according to the largest pair score found, considering the various combinations of alignments for each mate. A pair score is the sum of the two alignment scores minus a pairing penalty, which estimates the unlikelihood of insert lengths further from the mean insert than this aligned pair.

The unpaired-pen option specifies how much alignment pair scores should be penalized when the two alignments are not in properly paired position or orientation. The option also serves as the maximum pairing penalty for properly paired alignments with extreme insert lengths.

The unpaired-pen option is specified in Phred scale, according to its potential impact on MAPQ. Internally, the penalty is scaled into the alignment score matrix based on Smith-Waterman scoring parameters.

--Aligner.pe-max-penalty

The pe-max-penalty option limits how much the estimated MAPQ for one read can increase because its mate aligned nearby. A paired alignment is never assigned MAPQ higher than the MAPQ that it would have received through single-end mapping, plus the value of pe-max-penalty.

By default, pe-max-penalty = mapq-max = 255, effectively disabling this limit.

The key difference between unpaired-pen and pe-max-penalty is that unpaired-pen affects calculated pair scores and thus which alignments are selected. The pe-max-penalty option affects only reported MAPQ for paired alignments.

Mean Insert Size Detection

When working with paired-end data, DRAGEN chooses likely pairs from the highest-quality alignments for the two ends. To make this choice, DRAGEN uses a Gaussian statistical model to evaluate the likelihood that a pair of alignments constitutes a pair. This model is based on the intuition that a particular library prep tends to create fragments of roughly similar size, thus producing pairs whose insert lengths cluster well around some mean insert length.

If you know the statistics of the library prep for an input file and the file consists of a single read group, you can specify the following characteristics of the insert-length distribution:

•

Mean

•

Standard deviation

•

Three quartiles

These characteristics can be specified with the pe-stat-mean-insert, pe-stat-stddev-insert, pe-stat-quartiles-insert, and pe-stat-mean-read-len options. However, allowing DRAGEN to detect these characteristics automatically is typically preferred.

To enable automatic sampling of the insert-length distribution, set --enable-sampling to true. When the software starts execution, it runs a sample of up to 100,000 pairs through the aligner, calculates the distribution, and then uses the resulting statistics for evaluating all pairs in the input set.

The DRAGEN host software reports the statistics in its stdout log, as follows:

Final paired-end statistics detected for read group 0, based on 79935 high quality pairs for FR orientation

Quartiles (25 50 75) = 398 410 421

Mean = 410.151

Standard deviation = 14.6773

Boundaries for mean and standard deviation: low = 352, high = 467

Boundaries for proper pairs: low = 329, high = 490

NOTE: DRAGEN's insert estimates include corrections for clipping (so they are no identical to TLEN)

The insert length distribution for each sample is written to fragment_length_hist.csv. Each sample starts with the following lines:

#Sample: sample name
FragmentLength,Count

The lines are followed by the histogram.

When the number of sample pairs is very small, there is not enough information to characterize the distribution with high confidence. In this case, DRAGEN applies default statistics that specify a very wide insert distribution, which tends to admit pairs of alignments as proper pairs, even if they may lie tens of thousands of bases apart. In this situation, DRAGEN outputs a message, as follows.

WARNING: Less than 28 high quality pairs found - standard deviation is calculated from the small samples formula

The small samples formula calculates standard deviation as follows:

if samples < 3 then

standard deviation = 10000

else if samples < 28 then

standard deviation = 25 * (standard deviation + 1) / (samples – 2)

end if

if standard deviation < 12 then

standard deviation = 12

end if

The default model is standard deviation = 10000. If the first 100,000 reads are unmapped or all pairs are improper pairs, then the standard deviation is set to 10,000 and the mean and quartiles are set to 0. The minimum value for standard deviation is 12, which is independent of the number of samples.

For RNA-Seq data, the insert size distribution is not normal due to pairs containing introns. The DRAGEN software estimates the distribution using a kernel density estimator to fit a long tail to the samples. The estimate leads to a more accurate mean and standard deviation for RNA-Seq data and proper pairing.

DRAGEN writes detected paired-end stats into a tab-delimited log file in the output directory called .insert-stats.tab. The file contains the statistical distribution of detected insert sizes for each read group, including quartiles, mean, standard deviation, minimum, and maximum. The information matches the standard output report . Also, the log file includes the minimum and maximum insert limits that DRAGEN applied for rescue scans.

Output Options

DRAGEN can track multiple independent alignments for each read. The alignments include the optimal (primary) one, alignments mapping different subsegments of the read (chimeric/supplementary), and suboptimal (secondary) mappings of the read to different areas of the reference.

For DNA alignment by default, DRAGEN can emit one primary alignment for each read, up to three chimeric alignments (Aligner.supp-aligns=3), and no secondary alignments (Aligner.sec-aligns=0). The maximum user-specified value for supp-aligns or sec-aligns is 4095.

The following configuration options control how many of each type of alignment to include in the DRAGEN output.

Option

Description

--Aligner.mapq-max

The mapq-max option specifies a limit on the estimated MAPQ that can be reported for any alignment. Values from 0 to 255 are valid. If the calculated MAPQ is higher, the mapq-max value is reported instead. The default is 60.

--Aligner.supp-aligns

--Aligner.sec-aligns

The supp-aligns and sec-aligns options restrict the maximum number of supplementary (ie, chimeric and SAM FLAG 0x800) alignments and secondary (ie, suboptimal and SAM FLAG 0x100) alignments, respectively, that can be reported for each read. A maximum of 4095 supplementary alignments and 4095 secondary alignments can be reported for any read, in addition to a primary alignment. High settings for these two options impact speed so it is advisable to increase only as needed.

--Aligner.sec-phred-delta

The sec-phred-delta option controls the secondary alignments that are emitted based on the alignment score relative to the primary reported alignment. Only secondary alignments within this Phred value of the primary are reported.

--Aligner.sec-aligns-hard

The sec-aligns-hard option suppresses the output of all secondary alignments if there are more secondary alignments than can be emitted. When not all secondary alignments can be output, set sec-aligns-hard to 1 to force the read to be unmapped.

--Aligner.supp-as-sec

When the supp-as-sec option is set to 1, then supplementary (chimeric) alignments are reported with SAM FLAG 0x100 instead of 0x800. The default value is 0. The supp-as-sec option provides compatibility with tools that do not support FLAG 0x800.

--Aligner.hard-clips

The hard-clips option is used as a field of 3 bits, with values ranging from 0 to 7. The bits specify alignments, as follows.

•

Bit 0 is primary alignments

•

Bit 1 is supplementary alignments

•

Bit 2 is secondary alignments

Each bit determines whether local alignments of that type are reported with hard clipping (1) or soft clipping (0). The default value is 6, meaning primary alignments use soft clipping and supplementary and secondary alignments use hard clipping.