Analysis Methods
DRAGEN for Illumina DNA Prep with Enrichment performs demultiplexing, FASTQ generation, read mapping, and alignment to a reference genome
• | FASTQ generation |
• | Germline FASTQ and VCF generation |
• | Somatic FASTQ and VCF generation |
FASTQ Generation
The assembled sequences are written to FASTQ files per sample. FASTQ files are text files that contain sequencing data and quality scores for only one sample. For each sample, separate FASTQ files are generated per flow cell lane, per sequencing read. The name of the sample as specified during run setup is included in the FASTQ file name. FASTQ files are the primary input for alignment. The first step of FASTQ generation is demultiplexing. Demultiplexing assigns clusters that pass filter to a sample by comparing each index read sequence to the index sequences specified for the run. No quality values are considered in this step. Index reads are identified using the following steps:
• | Samples are numbered starting from 1 based on the order they are listed for the run. |
• | Sample number 0 is reserved for clusters that were not assigned to a sample. |
• | Clusters are assigned to a sample when the index sequence matches exactly or when there is up to a single mismatch per index read. |
The software includes ORA compression to compress FASTQ files. When using the ORA (*.ora) format, the md5 checksum of the FASTQ content is preserved after a compression and decompression cycle to ensure a lossless compression.
DNA Mapping & Alignment
The first stage of mapping is to generate seeds from the read, and then look for exact matches in the reference genome. These results are then refined by running full Smith-Waterman alignments on the locations with the highest density of seed matches. This well-documented algorithm works by comparing each position of the read against all the candidate positions of the reference. These comparisons correspond to a matrix of potential alignments between read and reference. For each of these candidate alignment positions, Smith-Waterman generates scores that are used to evaluate whether the best alignment passing through that matrix cell reaches it by a nucleotide match or mismatch (diagonal movement), a deletion (horizontal movement), or an insertion (vertical movement). A match between read and reference provides a bonus on the score, and a mismatch or indel imposes a penalty. The overall highest scoring path through the matrix is the alignment chosen.
The specific values chosen for scores in this algorithm indicate how to balance, for an alignment with multiple possible interpretations, the possibility of an indel as opposed to one or more SNPs, or the preference for an alignment without clipping. The default DRAGEN scoring values are reasonable for aligning moderate length reads to a whole human reference genome for variant calling applications. Any set of Smith-Waterman scoring parameters represents an imprecise model of genomic mutation and sequencing errors.Differently tuned alignment scoring values can be more appropriate for some applications.
DRAGEN Germline Variant Calling
The DRAGEN Germline Small Variant Caller takes mapped and aligned DNA reads as input and calls SNPs and indels through a combination of column-wise detection and local de novo assembly of haplotypes.
Callable reference regions are first identified with sufficient alignment coverage. Within these reference regions, a fast scan of the sorted reads identifies active regions, which are centered around pileup columns with evidence of a variant. The active regions are padded with enough context to cover significant, nonreference content nearby. If there is evidence of indels, the active regions receive additional padding.
Aligned reads are clipped within each active region and assembled into a De Bruijn graph. The edges of the clipped reads are weighted by observation counts, with the reference sequence as a backbone. After some graph cleanup and simplification, all source-to-sink paths are extracted as candidate haplotypes. Each haplotype is Smith-Waterman aligned to the reference genome to identify the variants it represents. This set of events may be augmented by a position-based detection. For each read-haplotype pair, the probability P(r|H) of observing the read, assuming the haplotype is the true starting sample, is estimated using a pair hidden Markov model (HMM).
Scanning by reference position over the active region, candidate genotypes are formed from diploid combinations of variant events (SNPs or indels). For each event (including reference), the conditional probability P(r|e) of observing each overlapping read is estimated as the maximum P(r|H) for haplotypes supporting the event. These are combined into the conditional probability P(r|e1e2) for a genotype (event pair) and multiplied to yield the conditional probability P(R|e1e2) of observing the whole read pileup. Using Bayes’ Formula, the posterior probability P(e1e2|R) of each diploid genotype is calculated, and the winner is called.
In the gVCF mode used for scalable multisample variant calling, the DRAGEN Germline Small Variant Caller can be run per-sample to generate an intermediate genomic variant call file (gVCF). The gVCF can then be used for efficient joint genotyping of multiple samples, which allows for the rapid incremental processing of samples and scaling to large cohort sizes.
Because the DRAGEN Germline Small Variant Caller has algorithms that make it able to efficiently distinguish correlated errors from true variants, the filtering rules are very simple.
DRAGEN Somatic Variant Calling
The DRAGEN Somatic Small Variant Caller takes mapped and aligned DNA reads as input and calls SNVs and indels through local de novo assembly of haplotypes in an active region.
Callable reference regions are first identified with sufficient alignment coverage. Within these reference regions, a scan of the sorted reads identifies active regions, which are centered around pileup columns with evidence of a variant in the tumor reads. The active regions are padded with enough context to cover significant, nonreference content nearby. If there is evidence of indels, the active regions receive additional padding.
Aligned reads are clipped within each active region and assembled into a De Bruijn graph. The edges of the clipped reads are weighted by observation counts, with the reference sequence as a backbone. After some graph cleanup and simplification, all source-to-sink paths are extracted as candidate haplotypes. Each haplotype is Smith-Waterman aligned to the reference genome to identify the variants it represents. For each read-haplotype pair, the probability P(r|H) of observing the read is estimated using a pair hidden Markov model (HMM) assuming the haplotype is the true starting sample.
To determine the TLOD score, DRAGEN Somatic Small Variant Caller first scans by reference position for each candidate somatic event as well as the reference event over the active region.The conditional probability P(r|e) of observing each overlapping read is estimated as the maximum P(r|H) for haplotypes supporting the event. These are combined into the conditional probability P(r|E) for an event hypothesis, E, involving a mixture of the reference and candidate somatic allele over a range of possible allele frequencies and multiplied to yield the conditional probability P(R|E) of observing the whole read pileup. From there, a TLOD score is calculated as the evidence that an ALT allele is present in the tumor sample at a given locus.