DRAGEN-ML

Setup

Additional setup is not required. ML model files for the hg38 and hg19 human references are packaged with the DRAGEN installer.

•

After the installation, the files will be present in the /opt/edico/resources/ml_model/<ref> folder.

•

DRAGEN-ML is enabled as needed, when running the germline SNV VC.

•

DRAGEN detects the reference used for analysis, and will use the correct model files. If hg38 or hg19 reference types are not detected, then ML recalibration will automatically be disabled and SNV VC falls back to legacy operation.

Command-line Options

Example DRAGEN CMD line options:

--vc-ml-dir=/path/to/package/directory --vc-ml-enable-recalibration=true

Where /path/to/package/directory contains the extracted support files from the package for DRAGEN-ML

Inputs

Since the machine learning model extracts information from the read pile-up, DRAGEN-ML requires a run with BAM or FASTQ input. DRAGEN-ML runs concurrently with DRAGEN SNV VC. DRAGEN-ML can be applied to WGS or WES samples. Re-calibration of existing VCF files is not supported.

Outputs

DRAGEN-ML recalibrates all quality scores, changing the values of the QUAL and GQ fields in the output VCF/GVCF.

•

DRAGEN-ML also updates PL and GP in the output VCF/GVCF.

•

The genotypes (GT field) of some variants may be changed by ML eg, 0/1 to 1/1 or vice versa.

•

DRAGEN-ML PHRED scores are limited to a maximum value of around 60-70. Therefore, the QUAL filtering threshold is set to 3 when DRAGEN-ML is enabled, compared to 10 for DRAGEN-VC when DRAGEN-ML is disabled.

The following variants types are re-calibrated:

•

Biallelic and multiallelic variants

•

Autosomes and sex chromosomes, including haploid positions

•

Force GT calls

•

Non primary contigs

Accuracy Improvements

DRAGEN-ML typically removes 30-50% of SNP FPs, with smaller gains on INDELS. FN counts are reduced by 10% or more. The output QUAL/GQ of DRAGEN-ML is empirically more accurately calibrated than DRAGEN SNV VC without ML. There are significant gains in accuracy statistics across the entire genome with ML enabled. Note that a small number of variant calls may have degraded accuracy with ML enabled compared to VC without ML.

Run time

DRAGEN-ML adds about 10% to the run time compared to runs without ML.