Primary Seed Length

For read mapping to succeed, at least one primary seed must match exactly or with a single SNP when edited seeds are used. Shorter seeds are more likely to map successfully to the reference, because more seeds can fit in each read and seeds are less likely to overlap variants or have sequencing errors.

However, very short seeds can sometimes reduce mapping accuracy. Very short seeds often map to multiple reference positions and can lead the mapper to consider more false mapping locations. Due to imperfect modeling of mutations and errors by Smith-Waterman alignment scoring and other heuristics, occasionally these noise matches are reported. Run time quality filters such as --Aligner.aln_min_score can control the accuracy issues with very short seeds.

Speed Considerations

Shorter seeds tend to slow down mapping because the seeds map to more reference locations, which result in additional work, such as using Smith-Waterman alignments to determine the best result. This effect is most pronounced when primary seed length approaches the uniqueness threshold of the reference genome, eg, K=16 for whole human genome.

Application Considerations

•

Read Length—Generally, shorter seeds are appropriate for shorter reads, and longer seeds for longer reads. Within a short read, a few mismatch positions from variants or sequencing errors can chop the read into only short segments that match the reference, so that only a short seed can fit between the differences and match the reference exactly. For example, in a 36 bp read, one SNP in the middle can block seeds longer than 18 bp from matching the reference. In a 250 bp read, 15 SNPs are required to exceed a 0.01% chance of blocking 27 bp seeds.

•

Paired Ends—The use of paired end reads can improve mapping accuracy for longer reads. DRAGEN uses paired end information to improve mapping accuracy, including with rescue scans that search the expected reference window when only one mate has seeds mapping to a given reference region. Thus, paired end reads have twice the opportunity for an exact matching seed to find their correct alignments.

•

Variant or Error Rate—When read differences from the reference are more frequent, shorter seeds could be required to fit between the difference positions in a given read and match the reference exactly.

•

Mapping Percentage Requirement—If the application requires a high percentage of reads to be mapped (even at low MAPQ), short seeds may be helpful. Some reads that do not match the reference well are more likely to map using short seeds to find partial matches to the reference.