QC Metrics and Coverage/Callability Reports

DRAGEN generates pipeline-specific metrics coverage reports during each run. There are four different groups of metrics that are generated at different stages of the pipeline:

•

Mapping and Aligning metrics

•

VCF metrics

•

Duration (or run time) metrics

•

Coverage (or enrichment) metrics and reports

The mapping/aligning metrics, VCF metrics, Duration metrics, and a subset of available coverage reports are autogenerated and do not require any activation or specific commands. Additional coverage metrics can be enabled, and additional coverage regions can be specified.

DRAGEN performs metric calculation during analysis so that it does not impact the run time.

Generation of Metrics and Reports

QC Metrics Output Format

The QC metrics are printed to the standard output and CSV files are written to the run output directory.

•

<output prefix>.mapping_metrics.csv

•

<output prefix>.vc_metrics.csv

•

<output prefix>.<coverage region prefix>_coverage_metrics.csv

•

<output prefix>.time_metrics.csv

•

<output prefix>.<other coverage reports>.csv

Section	RG/Sample	Metric	Count/Ration/Time	Percentage/Seconds
MAPPING/ALIGNING SUMMARY		Total input reads	816360354
MAPPING/ALIGNING SUMMARY		Number of duplicate reads (marked not removed)	15779031	1.93
...
MAPPING/ALIGNING PER RG	RGID_1	Total reads in RG	816360354	100
MAPPING/ALIGNING PER RG	RGID_1	Number of duplicate reads (marked)	15779031	1.93
...
VARIANT CALLER SUMMARY		Number of samples	1
VARIANT CALLER SUMMARY		Reads Processed	738031938
...
VARIANT CALLER PREFILTER	SAMPLE_1	Total	4918287	100
VARIANT CALLER PREFILTER	SAMPLE_1	Biallelic	4856654	98.75
...
COVERAGE SUMMARY		Aligned bases in genome	14847587	100.00
COVERAGE SUMMARY		Average alignment coverage over genome	0.03
...
RUN TIME		Time loading reference	00:18.6	18.65
RUN TIME		Time aligning reads	19:24.4	1164.42

Mapping and Aligning Metrics

Mapping and aligning metrics, such as the metrics computed by the Samtools Flagstat command, are available on an aggregate level (over all input data), and on a per read group level. Unless explicitly stated, the metrics units are in reads (ie, not in terms of pairs or alignments).

Metric

Description

Total input reads

Total number of reads in the input FASTQ files.

Number of duplicate marked reads

Reads marked as duplicates as a result of the --enable-duplicate-marking option being set to true.

Number of duplicate marked and mate reads removed

Reads marked as duplicates, along with any mate reads that are removed when the --remove-duplicates option is set to true.

Number of unique reads

Total number of reads minus the duplicate marked reads.

Reads with mate sequenced

Number of reads with a mate.

Reads without mate sequenced

Total number of reads minus number of reads with mate sequenced.

QC-failed reads

Reads that did not pass platform/ vendor quality checks (SAM flag 0x200).

Mapped reads

Total number of mapped reads minus number of unmapped reads.

Number of unique and mapped reads

Number of mapped reads minus number of duplicate marked reads.

Unmapped reads

Total number of reads that could not be mapped.

Singleton reads

Number of reads where the read could be mapped, but the paired mate could not be read.

Paired reads

Count of reads in which both reads in the pair are mapped.

Properly paired reads

Both reads in the pair are mapped and fall within an acceptable range from each other based on the estimated insert length distribution.

Not properly paired reads (discordant)

The number of paired reads minus the number of properly paired reads.

Paired reads mapped to different chromosomes

The number of reads with a mate, where the mate was mapped to a different chromosome.

Paired reads mapped to different chromosomes (MAPQ >= 10)

The number of reads with a MAPQ > 10 and with a mate, where the mate was mapped to a different chromosome.

Reads with indel R1

The percentage of R1 reads containing at least 1 indel.

Reads with indel R2

The percentage of R2 reads containing at least 1 indel.

Soft-clipped bases R1

The percentage of bases in R1 reads that are soft-clipped.

Soft-clipped bases R2

The percentage of bases in R2 reads that are soft-clipped.

Mismatched bases R1

The number of mismatched bases on R1, which is the sum of SNP count and indel lengths. The metric does not count anything within soft clipping or RNA introns. The metric also does not count a mismatch if either the reference base or read base is N.

Mismatched bases R2

The number of mismatched bases on R2, which is the sum of SNP count and indel lengths. The metric does not count anything within soft clipping or RNA introns. The metric also does not count a mismatch if either the reference base or read base is N.

Mismatched bases R1 (excluding indels)

The number of mismatched bases on R1. The indels lengths are ignored. It does not count anything within soft clipping or RNA introns. The metric also does not count a mismatch if either the reference base or read base is N.

Mismatched bases R2 (excluding indels)

The number of mismatched bases on R2. The indels lengths are ignored. The metric does not count anything within soft clipping or RNA introns. The metric also does not count a mismatch if either the reference base or read base is N.

Q30 Bases

The total number of bases with a BQ ≥ 30.

Q30 Bases R1

The total number of bases on R1 with a BQ ≥ 30.

Q30 Bases R2

The total number of bases on R2 with a BQ >= 30.

Q30 Bases (excluding dups and clipped bases)

The number of bases on nonduplicate and nonclipped bases with a BQ ≥ 30.

Histogram of reads map qualities

•

Reads with MAPQ [40:inf)

•

Reads with MAPQ [30:40)

•

Reads with MAPQ [20:30)

•

Reads with MAPQ [10:20)

•

Reads with MAPQ [0:10)

Total alignments

Total number of loci reads aligned to with > 0 quality.

Secondary alignments

Number of secondary alignment loci.

Supplementary (chimeric) alignments

A chimeric read is split over multiple loci (possibly due to structural variants). One alignment is referred to as the representative alignment. The other are supplementary.

Estimated read length

Total number of input bases divided by the number of reads.

Histogram

See Histogram Coverage Report.

PCT of bases aligned that fell inside the interval region

Number of bases inside the interval region and the target region divided by the total number of bases aligned.

Estimated sample contamination

The estimated fraction of reads in a sample that may be from another human source.

Cross-sample Contamination

The DRAGEN cross-sample contamination module uses a probabilistic mixture model to estimate the fraction of reads in a sample that may be from another human source. DRAGEN supports separate modes for germline and somatic samples.

The germline model, like VerifyBamID, assumes that a sample can be modeled as a DNA mixture from 2 or more individuals. Pileup analysis investigates loci where variants are common in the general population. Variants with high allele frequencies are likely to be real germline variants in the individual of interest, while low allele frequency variants are likely noise or germline variants from a contaminating sample B. The probabilistic mixture model accounts for noise and tries to detect consistent allele frequency distributions.

For example, if the pileups show consistent low allele frequencies of 1% or 2%, then the mixture model will likely infer 2% contamination from sample B, where the 1% and 2% AF variants correspond to heterozygous and homozygous germline calls in sample B.

The germline cross-contamination metric is enabled by using the following setting and pointing a VCF that includes marker sites (RSIDs) with population allele frequencies that are close to 0.5.

--qc-cross-cont-vcf /opt/edico/config/sample_cross_contamination_resource_[hg19 or GRCh37 or GRCh38].vcf

The metric value is printed as a fraction, so a value of 0.011 represents 1.1% contamination from another sample.

MAPPING/ALIGNING SUMMARY Estimated sample contamination 0.011

Variant calling precision, especially somatic variant calling precision, can be significantly affected by cross-sample contamination. To safely use a sample, the cross-sample contamination level should be significantly lower than the minimum allele frequencies of interest. For example, if a sample has 1% contamination, then it may be pertinent to ignore all variants with less than 5% AF. The sample cross-contamination metric saturates close to 30% contamination.

Variant Calling Metrics

The generated variant calling metrics are similar to the metrics computed by RTG vcfstats. Metrics are reported for each sample in multi sample VCF and gVCF files. Based on the run case, metrics are reported either as standard VARIANT CALLER or JOINT CALLER. Metrics are reported both for the raw (PREFILTER) and hard filtered (POSTFILTER) VCF file.

Panel of Normals (PON) and COSMIC filtered variants are counted as PASS variants in the POSTFILTER VCF metrics. These PASS variants can cause higher than expected variant counts in the POSTFILTER VCF metrics.

Metric	Description
Number of samples	Number of samples in the population/ joint VCF.
Reads Processed	The number of reads used for variant calling, excluding any duplicate marked reads and reads falling outside of the target region.
Total	The total number of variants (SNPs + MNPs + indels).
Biallelic	Number of sites in a genome that contains two observed alleles. The reference is counted as one allele, which allows for one variant allele.
Multiallelic	Number of sites in the VCF that contain three or more observed alleles. The reference is counted as one, which allows for two or more variant alleles.
SNPs	A variant is counted as an SNP when the reference, allele 1, and allele 2 are all length 1.
Insertions (Hom)	Number of variants that contains homozygous insertions.
Insertions (Het)	Number of variants where both alleles are insertions, but not homozygous.
Deletions (Het)	Number of variants that contains homozygous deletions.
INDELS (Het)	Number of variants where genotypes are either [insertion+deletion], [insertion+SNP], or [deletion+SNP].
De Novo SNPs	De novo marked SNPs with DQ > 0.05. Set the --qc-snp-denovo-quality-threshold option to the required threshold. The default is 0.05.
De Novo INDELs	De novo marked indels with DQ values > 0.02. This DQ threshold can be specified by setting the --qc-indel-denovo-quality-threshold option to the required DQ threshold. The default is 0.02.
De Novo MNPs	De novo marked SNPs with DQ > 0.05. Set the --qc-snp-denovo-quality-threshold to the required threshold. The default is 0.05.
(Chr X SNPs)/(Chr Y SNPs) ratio in the genome (or the target region)	Number of SNPs in chromosome X (or in the intersection of chromosome X with the target region) divided by the number of SNPs in chromosome Y (or in the intersection of chromosome Y with the target region). If there was no alignment to either chromosome X or chromosome Y, this metric shows as NA.
SNP Transitions	An interchange of two purines (A<->G) or two pyrimidines (C<->T).
SNP Transversions	An interchange of purine and pyrimidine bases Ti/Tv ratio: ratio of transitions to transitions.
Heterozygous	Number of heterozygous variants.
Homozygous	Number of homozygous variants.
Het/Hom ratio	Heterozygous/ homozygous ratio.
In dbSNP	Number of variants detected that are present in the dbSNP reference file. If no dbSNP file is provided via the --bsnp option, then both the In dbSNP and Novel metrics show as NA.
Novel	Total number of variants minus number of variants in dbSNP.
Percent Callability	Available in germline and somatic modes with gVCF output. The percentage of non-N reference positions having a PASSing genotype call. Multiallelic variants are not counted. Deletions are counted for all the deleted reference positions only for homozygous calls. Only autosomes and chromosomes X, Y, and M are considered. To produce this metric for non-human references, set --qc-callability-autosome-contigs to specify the autosome contig names. Optionally, --qc-callability-xym-contigs allows setting X, Y and M contig names.
Percent Autosome Callability	Only autosomes are considered. To produce this metric for non-human references, set --qc-callability-autosome-contigs to specify the autosome contig names.
Percent QC Region Callability in Region i (i is equivalent to regions 1, 2, or 3)	Available if callability for custom regions is requested via the --qc-coverage-region-i option and the callability output is specified with --qc-coverage-reports-i. All contigs are considered. Setting --qc-callability-autosome-contigs enables outputting this metric for non-human references.

Per Contig Het/Hom Ratio

When the germline small variant caller is executed, DRAGEN calculates a per het/hom ratio per contig. DRAGEN reports the ratios for both the raw (PREFILTER) and hard-filtered (POSTFILTER) VCF. The metrics are output to the .vc_hethom_ratio_metrics.csv file. The file contains the following values for each primary contig processed.

•

Contig

•

Number of heterozygous variants

•

Number of homozygous variants

•

Het/Hom ratio

The following example shows a section of the metrics.

VARIANT CALLER POSTFILTER,HG04070,1 Heterozygous,185733

VARIANT CALLER POSTFILTER,HG04070,1 Homozygous,182928

VARIANT CALLER POSTFILTER,HG04070,1 Het/Hom ratio,1.015

VARIANT CALLER POSTFILTER,HG04070,2 Heterozygous,203946

VARIANT CALLER POSTFILTER,HG04070,2 Homozygous,174294

VARIANT CALLER POSTFILTER,HG04070,2 Het/Hom ratio,1.170

VARIANT CALLER POSTFILTER,HG04070,3 Heterozygous,192861

VARIANT CALLER POSTFILTER,HG04070,3 Homozygous,130087

VARIANT CALLER POSTFILTER,HG04070,3 Het/Hom ratio,1.483

VARIANT CALLER POSTFILTER,HG04070,4 Heterozygous,178389

VARIANT CALLER POSTFILTER,HG04070,4 Homozygous,157062

VARIANT CALLER POSTFILTER,HG04070,4 Het/Hom ratio,1.136

You can use the het/hom ratio values as an indication of whole chromosome uniparental disomy (UPD). UPD of certain chromosomes are associated with genetic syndromes known as imprinting disorders. Whole chromosome UPD have het/hom ratios close to 0.0. Ranges vary, but are usually between 1.0–2.0. Make sure you are interpreting het/hom ratios in the context of your own sequencing data.

Duration Metrics

The duration metrics section includes a breakdown of the run duration for each process. For example, the following metrics are generated for the mapper and variant caller pipeline:

•

Time loading reference

•

Time aligning reads

•

Time sorting and marking duplicates

•

Time DRAGStr calibration

•

Time partial reconfiguration

•

Time variant calling

•

Total run time