QC Metrics and Coverage/Callability Reports

DRAGEN generates pipeline-specific metrics coverage reports during each run. There are four different groups of metrics that are generated at different stages of the pipeline:

•

Mapping and Aligning metrics

•

VCF metrics

•

Duration (or run time) metrics

•

Coverage (or enrichment) metrics and reports

The mapping/aligning metrics, VCF metrics, Duration metrics, and a subset of available coverage reports are autogenerated and do not require any activation or specific commands. Additional coverage metrics can be enabled, and additional coverage regions can be specified.

DRAGEN performs metric calculation during analysis so that it does not impact the run time.

Generation of Metrics and Reports

QC Metrics Output Format

The QC metrics are printed to the standard output and CSV files are written to the run output directory.

•

<output prefix>.mapping_metrics.csv

•

<output prefix>.vc_metrics.csv

•

<output prefix>.time_metrics.csv

•

<output prefix>.<coverage region prefix>_coverage_metrics.csv

•

<output prefix>.<other coverage reports>.csv

Section	RG/Sample	Metric	Count/Ration/Time	Percentage/Seconds
MAPPING/ALIGNING SUMMARY		Total input reads	816360354
MAPPING/ALIGNING SUMMARY		Number of duplicate reads (marked not removed)	15779031	1.93
...
MAPPING/ALIGNING PER RG	RGID_1	Total reads in RG	816360354	100
MAPPING/ALIGNING PER RG	RGID_1	Number of duplicate reads (marked)	15779031	1.93
...
VARIANT CALLER SUMMARY		Number of samples	1
VARIANT CALLER SUMMARY		Reads Processed	738031938
...
VARIANT CALLER PREFILTER	SAMPLE_1	Total	4918287	100
VARIANT CALLER PREFILTER	SAMPLE_1	Biallelic	4856654	98.75
...
RUN TIME		Time loading reference	00:18.6	18.65
RUN TIME		Time aligning reads	19:24.4	1164.42

Mapping and Aligning Metrics

Mapping and aligning metrics, such as the metrics computed by the Samtools Flagstat command, are available on an aggregate level (over all input data), and on a per read group level. Unless explicitly stated, the metrics units are in reads (ie, not in terms of pairs or alignments).

Metric

Description

Total input reads

Total number of reads in the input FASTQ files.

Number of duplicate marked reads

Reads marked as duplicates as a result of the --enable-duplicate-marking option being set to true.

Number of duplicate marked and mate reads removed

Reads marked as duplicates, along with any mate reads that are removed when the --remove-duplicates option is set to true.

Number of unique reads

Total number of reads minus the duplicate marked reads.

Reads with mate sequenced

Number of reads with a mate.

Reads without mate sequenced

Total number of reads minus number of reads with mate sequenced.

QC-failed reads

Reads that did not pass platform/ vendor quality checks (SAM flag 0x200).

Mapped reads

Total number of mapped reads minus number of unmapped reads.

Number of unique and mapped reads

Number of mapped reads minus number of duplicate marked reads.

Unmapped reads

Total number of reads that could not be mapped.

Singleton reads

Number of reads where the read could be mapped, but the paired mate could not be read.

Paired reads

Count of reads in which both reads in the pair are mapped.

Properly paired reads

Both reads in the pair are mapped and fall within an acceptable range from each other based on the estimated insert length distribution.

Not properly paired reads (discordant)

The number of paired reads minus the number of properly paired reads.

Paired reads mapped to different chromosomes

The number of reads with a mate, where the mate was mapped to a different chromosome.

Paired reads mapped to different chromosomes (MAPQ >= 10)

The number of reads with a MAPQ > 10 and with a mate, where the mate was mapped to a different chromosome.

Reads with indel R1

The percentage of R1 reads containing at least 1 indel.

Reads with indel R2

The percentage of R2 reads containing at least 1 indel.

Soft-clipped bases R1

The percentage of bases in R1 reads that are soft-clipped.

Soft-clipped bases R2

The percentage of bases in R2 reads that are soft-clipped.

Mismatched bases R1

The number of mismatched bases on R1, which is the sum of SNP count and indel lengths. The metric does not count anything within soft clipping or RNA introns. The metric also does not count a mismatch if either the reference base or read base is N.

Mismatched bases R2

The number of mismatched bases on R2, which is the sum of SNP count and indel lengths. The metric does not count anything within soft clipping or RNA introns. The metric also does not count a mismatch if either the reference base or read base is N.

Mismatched bases R1 (excluding indels)

The number of mismatched bases on R1. The indels lengths are ignored. It does not count anything within soft clipping or RNA introns. The metric also does not count a mismatch if either the reference base or read base is N.

Mismatched bases R2 (excluding indels)

The number of mismatched bases on R2. The indels lengths are ignored. The metric does not count anything within soft clipping or RNA introns. The metric also does not count a mismatch if either the reference base or read base is N.

Q30 Bases

The total number of bases with a BQ ≥ 30.

Q30 Bases R1

The total number of bases on R1 with a BQ ≥ 30.

Q30 Bases R2

The total number of bases on R2 with a BQ >= 30.

Q30 Bases (excluding dups and clipped bases)

The number of bases on nonduplicate and nonclipped bases with a BQ ≥ 30.

Histogram of reads map qualities

•

Reads with MAPQ [40:inf)

•

Reads with MAPQ [30:40)

•

Reads with MAPQ [20:30)

•

Reads with MAPQ [10:20)

•

Reads with MAPQ [0:10)

Total alignments

Total number of loci reads aligned to with > 0 quality.

Secondary alignments

Number of secondary alignment loci.

Supplementary (chimeric) alignments

A chimeric read is split over multiple loci (possibly due to structural variants). One alignment is referred to as the representative alignment. The other are supplementary.

Estimated read length

Total number of input bases divided by the number of reads.

Histogram

See Histogram Coverage Report.

PCT of bases aligned that fell inside the interval region

Number of bases inside the interval region and the target region divided by the total number of bases aligned.

Estimated sample contamination

The estimated fraction of reads in a sample that may be from another human source.

The prediction accuracy of variant calling is affected by cross-sample contamination. Even small levels of contamination can lead to many FP calls, especially in pipelines where the aim is to detect variants with low allele frequencies.

The DRAGEN cross-sample contamination module uses a probabilistic mixture model to estimate the fraction of reads in a sample that may be from another human source. This sample contamination fraction is estimated as the parameter value in the mixture model that maximizes the likelihood of the observed reads at multiple pile up locations. The mixture model accounts for the population allele frequencies and the inferred sample genotypes.

To enable this metric in germline mode, you must provide the file path on the command line to a VCF that includes marker sites (RSIDs) with population allele frequencies.

--qc-cross-cont-vcf /opt/edico/config/sample_cross_contamination_resource_[hg19 or GRCh37 or GRCh38].vcf

In somatic mode the contamination algorithm first tries to avoid biases that could be introduced by CNV or LoH. The algorithm also estimates nucleotide noise from the sample to adjust for FFPE samples.

To enable somatic contamination detection, use the following setting.

--qc-somatic-contam-vcf /opt/edico/config/somatic_sample_cross_contamination_resource_[hg19 or GRCh37 or GRCh38].vcf.gz

The VCF resource files that are included with DRAGEN can be reconstructed from the Ensembl database. The VCF files included in the DRAGEN config folder contain ~5000 marker locations where the population AFs are close to 0.5. The files are reference-specific (hg19/GRCh37/hg38). DRAGEN will abort if an incompatible resource and reference file is used (eg, CRCh37 resource file and hg19 reference).

The following shows example output for a sample with 1.1% estimated contamination. This value is provided as a fraction, so a value of 0.011 is the same as 1.1%.

MAPPING/ALIGNING SUMMARY Estimated sample contamination 0.011

Variant Calling Metrics

The generated variant calling metrics are similar to the metrics computed by RTG vcfstats. Metrics are reported for each sample in multi sample VCF and gVCF files. Based on the run case, metrics are reported either as standard VARIANT CALLER or JOINT CALLER. Metrics are reported both for the raw (PREFILTER) and hard filtered (POSTFILTER) VCF file.

Panel of Normals (PON) and COSMIC filtered variants are counted as PASS variants in the POSTFILTER VCF metrics. These PASS variants can cause higher than expected variant counts in the POSTFILTER VCF metrics.

Metric	Description
Number of samples	Number of samples in the population/ joint VCF.
Reads Processed	The number of reads used for variant calling, excluding any duplicate marked reads and reads falling outside of the target region.
Total	The total number of variants (SNPs + MNPs + indels).
Biallelic	Number of sites in a genome that contains two observed alleles. The reference is counted as one allele, which allows for one variant allele.
Multiallelic	Number of sites in the VCF that contain three or more observed alleles. The reference is counted as one, which allows for two or more variant alleles.
SNPs	A variant is counted as an SNP when the reference, allele 1, and allele 2 are all length 1.
Insertions (Hom)	Number of variants that contains homozygous insertions.
Insertions (Het)	Number of variants where both alleles are insertions, but not homozygous.
Deletions (Het)	Number of variants that contains homozygous deletions.
INDELS (Het)	Number of variants where genotypes are either [insertion+deletion], [insertion+SNP], or [deletion+SNP].
De Novo SNPs	De novo marked SNPs with DQ > 0.05. Set the --qc-snp-denovo-quality-threshold option to the required threshold. The default is 0.05.
De Novo INDELs	De novo marked indels with DQ values > 0.02. This DQ threshold can be specified by setting the --qc-indel-denovo-quality-threshold option to the required DQ threshold. The default is 0.02.
De Novo MNPs	De novo marked SNPs with DQ > 0.05. Set the --qc-snp-denovo-quality-threshold to the required threshold. The default is 0.05.
(Chr X SNPs)/(Chr Y SNPs) ratio in the genome (or the target region)	Number of SNPs in chromosome X (or in the intersection of chromosome X with the target region) divided by the number of SNPs in chromosome Y (or in the intersection of chromosome Y with the target region). If there was no alignment to either chromosome X or chromosome Y, this metric shows as NA.
SNP Transitions	An interchange of two purines (A<->G) or two pyrimidines (C<->T).
SNP Transversions	An interchange of purine and pyrimidine bases Ti/Tv ratio: ratio of transitions to transitions.
Heterozygous	Number of heterozygous variants.
Homozygous	Number of homozygous variants.
Het/Hom ratio	Heterozygous/ homozygous ratio.
In dbSNP	Number of variants detected that are present in the dbSNP reference file. If no dbSNP file is provided via the --bsnp option, then both the In dbSNP and Novel metrics show as NA.
Novel	Total number of variants minus number of variants in dbSNP.
Percent Callability	Available in germline and somatic modes with gVCF output. The percentage of non-N reference positions having a PASSing genotype call. Multiallelic variants are not counted. Deletions are counted for all the deleted reference positions only for homozygous calls. Only autosomes and chromosomes X, Y, and M are considered.
Percent Autosome Callability	Only autosomes are considered.
Percent QC Region Callability in Region i (i is equivalent to regions 1, 2, or 3)	Available if callability for custom regions is requested via the --qc-coverage-region-i option and the callability output is specified with --qc-coverage-reports-i. All contigs are considered.

Per Contig Het/Hom Ratio

When the germline small variant caller is executed, DRAGEN calculates a per het/hom ratio per contig. DRAGEN reports the ratios for both the raw (PREFILTER) and hard-filtered (POSTFILTER) VCF. The metrics are output to the .vc_hethom_metrics.csv file. The file contains the following values for each primary contig processed.

•

Contig

•

Number of heterozygous variants

•

Number of homozygous variants

•

Het/Hom ratio

The following example shows a section of the metrics.

VARIANT CALLER POSTFILTER,HG04070,1 Heterozygous,185733

VARIANT CALLER POSTFILTER,HG04070,1 Homozygous,182928

VARIANT CALLER POSTFILTER,HG04070,1 Het/Hom ratio,1.015

VARIANT CALLER POSTFILTER,HG04070,2 Heterozygous,203946

VARIANT CALLER POSTFILTER,HG04070,2 Homozygous,174294

VARIANT CALLER POSTFILTER,HG04070,2 Het/Hom ratio,1.170

VARIANT CALLER POSTFILTER,HG04070,3 Heterozygous,192861

VARIANT CALLER POSTFILTER,HG04070,3 Homozygous,130087

VARIANT CALLER POSTFILTER,HG04070,3 Het/Hom ratio,1.483

VARIANT CALLER POSTFILTER,HG04070,4 Heterozygous,178389

VARIANT CALLER POSTFILTER,HG04070,4 Homozygous,157062

VARIANT CALLER POSTFILTER,HG04070,4 Het/Hom ratio,1.136

You can use the het/hom ratio values as an indication of whole chromosome uniparental disomy (UPD). UPD of certain chromosomes are associated with genetic syndromes known as imprinting disorders. Whole chromosome UPD have het/hom ratios close to 0.0. Ranges vary, but are usually between 1.0–2.0. Make sure you are interpreting het/hom ratios in the context of your own sequencing data.

Duration Metrics

The duration metrics section includes a breakdown of the run duration for each process. For example, the following metrics are generated for the mapper and variant caller pipeline:

•

Time loading reference

•

Time aligning reads

•

Time sorting and marking duplicates

•

Time DRAGStr calibration

•

Time partial reconfiguration

•

Time variant calling

•

Total run time