Seed Extension Control

DRAGEN seed extension is dynamic, and applied as needed for particular K-mers that map to too many reference locations.  Seeds are incrementally extended in steps of 2–14 bases (always even) from a primary seed length to a fully extended length. The bases are appended symmetrically in each extension step, which determines the next extension increment if any.

There is a potentially complex seed extension tree associated with each high frequency primary seed. Each full tree is generated during hash table construction and a path from the root is traced by iterative extension steps during seed mapping. The hash table builder employs a dynamic programming algorithm to search the space of all possible seed extension trees for an optimal one, by using a cost function that balances mapping accuracy and speed. The following options define that cost function:

--ht-target-seed-freq—Target Hit Frequency

The --ht-target-seed-freq option defines the ideal number of hits per seed for which seed extension should aim. Higher values lead to fewer and shorter final seed extensions because shorter seeds tend to match more reference positions.

--ht-cost-coeff-seed-len—Cost Coefficient for Seed Length

The --ht-cost-coeff-seed-len option assigns the cost component for each base by which a seed is extended. Additional bases are considered a cost because longer seeds risk overlapping variants or sequencing errors and losing their correct mappings. Higher values lead to shorter final seed extensions.

--ht-cost-coeff-seed-freq—Cost Coefficient for Hit Frequency

The --ht-cost-coeff-seed-freq option assigns the cost component for the difference between the target hit frequency and the number of hits populated for a single seed. Higher values result primarily in high-frequency seeds being extended further to bring their frequencies down toward the target.

--ht-cost-penalty—Cost Penalty for Seed Extension

The --ht-cost-penalty option assigns a flat cost for extending beyond the primary seed length. A higher value results in fewer seeds being extended at all. The default value is 0.

--ht-cost-penalty-incr—Cost Increment for Extension Step

The --ht-cost-penalty-incr option assigns a recurring cost for each incremental seed extension step taken from primary to final extended seed length. More steps are considered a higher cost because extending in many small steps requires more hash table space for intermediate EXTEND records, which takes substantially more run time to execute the extensions. A higher value results in seed extension trees with fewer nodes that reach from the root primary seed length to leaf extended seed lengths in fewer, larger steps.