Primary Seed Length

The --ht-seed-len option specifies the initial length in nucleotides of seeds from the reference genome to populate into the hash table. At run time, the mapper extracts seeds of the same length from each read and looks for exact matches unless seed editing is enabled in the hash table.

The maximum primary seed length is a function of hash table size. The limit is k=27 for table sizes from 16–64 GB, which covers typical sizes for whole human genome or k=26 for sizes from 4–16 GB. 

The minimum primary seed length depends mainly on the reference genome size and complexity. The seed length needs to be long enough to resolve most reference positions uniquely. For whole human genome references, hash table construction typically fails with k < 16. The lower bound could be smaller for shorter genomes or higher for less complex (more repetitive) genomes. The uniqueness threshold of --ht-seed-len 16 for the 3.1 Gbp human genome can be understood intuitively because log4(3.1 G) ≈ 16, so it requires at least 16 choices from four nucleotides to distinguish 3.1 G reference positions.