Hash Table / Seed Extensions
Due to repetitive sequences, some seeds of any given length can match more than one location in the reference genome. DRAGEN uses a unique mechanism called seed extension to successfully map such high-frequency seeds. When DRAGEN determines that a primary seed occurs at more than one reference location, DRAGEN extends the seed by a number of bases at both ends until the length is greater and more unique in the reference.
For example, a 21-base primary seed could be extended by seven bases at each end to a 35-base extended seed. A 21-base primary seed could match 100 places in the reference. 35-base extensions of these 100 seed positions could divide into 40 groups of 1–3 identical 35-base seeds. DRAGEN supports iterative seed extensions, which are automatically generated when a large set of identical primary seeds contains various subsets that are best resolved by different extension lengths.
By default the maximum extended seed length is equal to the primary seed length plus 128. To modify the maximum extended seed length, use the --ht-max-ext-seed-len option. For example, for short reads, DRAGEN recommends setting the maximum extended seed shorter than the read length, because extensions longer than the whole read can never match.
You can also tune how aggressively seeds are extended using the following options. These options are for advanced usage only.
• | --ht-cost-coeff-seed-len |
• | --ht-cost-coeff-seed-freq |
• | --ht-cost-penalty |
• | --ht-cost-penalty-incr |
There is a tradeoff between extension length and hit frequency. You can achieve faster mapping by using longer seed extensions to reduce seed hit frequencies. You can achieve more accurate mapping by avoiding seed extensions or keeping extensions short, while tolerating the resulting higher hit frequencies. Shorter extensions can benefit mapping quality both by fitting seeds better between SNPs and by finding more candidate-mapping locations at which to score alignments. The default extension settings, along with default seed frequency settings, lean aggressively toward mapping accuracy with relatively short seed extensions and high hit frequencies.
The defaults for the seed frequency options are as follows.
Option |
Default |
---|---|
--ht-cost-coeff-seed-len |
1 |
--ht-cost-coeff-seed-freq |
0.5 |
--ht-cost-penalty |
0 |
--ht-cost-penalty-incr |
0.7 |
--ht-max-seed-freq |
16 |
--ht-target-seed-freq |
4 |