Primary Seed Length
The --ht-seed-len option specifies the initial length in nucleotides of seeds from the reference genome to populate into the hash table. At run time, the mapper extracts seeds of the same length from each read and looks for exact matches unless seed editing is enabled in the hash table.
The maximum primary seed length is a function of hash table size. The limit is k=27 for table sizes from 16–64 GB, which covers typical sizes for whole human genome or k=26 for sizes from 4–16 GB.
The minimum primary seed length depends mainly on the reference genome size and complexity. The seed length needs to be long enough to resolve most reference positions uniquely. For whole human genome references, hash table construction typically fails with k < 16. The lower bound could be smaller for shorter genomes or higher for less complex (more repetitive) genomes. The uniqueness threshold of --ht-seed-len 16 for the 3.1 Gbp human genome can be understood intuitively because log4(3.1 G) ≈ 16, so it requires at least 16 choices from four nucleotides to distinguish 3.1 G reference positions.

For read mapping to succeed, at least one primary seed must match exactly or with a single SNP when edited seeds are used. Shorter seeds are more likely to map successfully to the reference, because more seeds can fit in each read and seeds are less likely to overlap variants or have sequencing errors.
However, very short seeds can sometimes reduce mapping accuracy. Very short seeds often map to multiple reference positions and can lead the mapper to consider more false mapping locations. Due to imperfect modeling of mutations and errors by Smith-Waterman alignment scoring and other heuristics, occasionally these noise matches are reported. Run time quality filters such as --Aligner.aln_min_score can control the accuracy issues with very short seeds.

Shorter seeds tend to slow down mapping because the seeds map to more reference locations, which result in additional work, such as using Smith-Waterman alignments to determine the best result. This effect is most pronounced when primary seed length approaches the uniqueness threshold of the reference genome, eg, K=16 for whole human genome.

• | Read Length—Generally, shorter seeds are appropriate for shorter reads, and longer seeds for longer reads. Within a short read, a few mismatch positions from variants or sequencing errors can chop the read into only short segments that match the reference, so that only a short seed can fit between the differences and match the reference exactly. For example, in a 36 bp read, one SNP in the middle can block seeds longer than 18 bp from matching the reference. In a 250 bp read, 15 SNPs are required to exceed a 0.01% chance of blocking 27 bp seeds. |
• | Paired Ends—The use of paired end reads can improve mapping accuracy for longer reads. DRAGEN uses paired end information to improve mapping accuracy, including with rescue scans that search the expected reference window when only one mate has seeds mapping to a given reference region. Thus, paired end reads have twice the opportunity for an exact matching seed to find their correct alignments. |
• | Variant or Error Rate—When read differences from the reference are more frequent, shorter seeds could be required to fit between the difference positions in a given read and match the reference exactly. |
• | Mapping Percentage Requirement—If the application requires a high percentage of reads to be mapped (even at low MAPQ), short seeds may be helpful. Some reads that do not match the reference well are more likely to map using short seeds to find partial matches to the reference. |