The Variant Caller Algorithm

The DRAGEN Small Variant Caller performs the following steps:

•

Active Region Identification—Identifies areas where multiple reads disagree with the reference are identified, and selects windows around them (active regions) for processing.

•

Localized Haplotype Assembly—Assembles all overlapping reads in each active region into a de Bruijn graph (DBG). A DBG is a directed graph based on overlapping K-mers (length K subsequences) in each read or multiple reads. When all reads are identical, the DBG is linear. Where there are differences, the graph forms bubbles of multiple paths that diverge and rejoin. If the local sequence is too repetitive and K is too small , cycles can form, which invalidate the graph. DRAGEN uses K=10 and 25 as the default values. If those values produce an invalid graph, then additional values of K=35, 45, 55, 65 are tried until a cycle-free graph is obtained. From this cycle-free DBG, DRAGEN extracts every possible path to produce a complete list of candidate haplotypes, ie, hypotheses for what the true DNA sequence might be on at least one strand.

•

Haplotype Alignment—Uses the Smith-Waterman algorithm to align each extracted haplotype to the reference genome. The alignments determine what variations from the reference are present.

•

Read Likelihood Calculation—Tests each read against each haplotype to estimate a probability of observing the read assuming the haplotype was the true original DNA sampled. This calculation is performed by evaluating a pair hidden Markov model (HMM), which accounts for the various possible ways the haplotype might have been modified by PCR or sequencing errors into the read observed. The HMM evaluation uses a dynamic programming method to calculate the total probability of any series of Markov state transitions arriving at the observed read.

•

Genotyping—Forms the possible diploid combinations of variant events from the candidate haplotypes and, for each combination, calculates the conditional probability of observing the entire read pileup. Calculations use the constituent probabilities of observing each read, given each haplotype from the pair HMM evaluation. These calculations feed into the Bayesian formula to calculate the likelihood that each genotype is the genotype of the sample being analyzed, given the entire read pileup observed. Genotypes with maximum likelihood are reported.