Targeted Variant Calling
Repetitive regions in the human genome pose a challenge for general variant calling approaches which typically cannot make use of any misplaced MAPQ0 reads. Furthermore, high sequence homology of some genes with a pseudogene paralog can lead to a wide variety of common structural variants (SVs) in the population, requiring specialized targeted calling approaches. DRAGEN supports targeted calling for a number of genes/targets as described in subsequent target-specific sections.
Each targeted caller can be enabled/disabled separately (refer to target-specific section), or all targeted callers can be enabled using the command line option --enable-targeted=true. The targeted callers produce a <prefix>.targeted.json containing a gene-level summary of the variant caller results. Additionally, the details of individual variant calls are reported in a <prefix>.targeted.vcf.gz.

All targeted callers require WGS data aligned to a human reference genome with at least 30x coverage. Human reference genome builds based on hg19, GRCh37, and hg38 are supported.


The targeted callers generate a <output-file-prefix>.targeted.json file in the output directory. The output file is a JSON formatted file containing the fields below.
Fields in JSON |
Explanation |
Type and Possible Values |
Present |
---|---|---|---|
sample |
The sample name. |
string |
always |
dragenVersion |
The version of DRAGEN. |
string |
always |
pharmcatMetabolismStatusResourceUrl |
Web URL for the PharmCAT resource used for calling metabolism status. |
string |
PGx gene target is enabled |
cyp2b6 |
The CYP2B6 targeted caller specific fields. |
dictionary |
CYP2B6 caller is enabled |
cyp2d6 |
The CYP2D6 targeted caller specific fields. |
dictionary |
CYP2D6 caller is enabled |
cyp21a2 |
The CYP21A2 targeted caller specific fields. |
dictionary |
CYP21A2 caller is enabled |
gba |
The GBA targeted caller specific fields. |
dictionary |
GBA caller is enabled |
hba |
The HBA targeted caller specific fields. |
dictionary |
HBA caller is enabled |
lpa |
The LPA targeted caller specific fields. |
dictionary |
LPA caller is enabled |
rh |
The RH targeted caller specific fields. |
dictionary |
RH caller is enabled |
smn |
The SMN targeted caller specific fields. |
dictionary |
SMN caller is enabled |
star_allele |
The Star Allele specific fields. |
dictionary |
Star Allele is enabled |

The targeted callers generate a <output-file-prefix>.targeted.vcf[.gz] file in the output directory. The output file is a VCFv4.2 formatted file possibly compressed. The targeted callers that support VCF output are: RH, HBA, LPA, and SMN.
Small variant calls and copy number variant calls are reported in the same VCF file.
The <output-file-prefix>.targeted.vcf[.gz] file includes the source header line defining the source of the VCF calls as from DRAGEN_TARGETED.
##source=DRAGEN_TARGETED
The main INFO fields that are used in the <output-file-prefix>.targeted.vcf[.gz] file are the EVENT and EVENTTYPE INFO fields.
The EVENT and EVENTTYPE INFO fields are formally introduced in VCFv4.4 to enable the representation of complex rearrangements. This is achieved using the EVENT field to group all the related VCF records together, and the EVENTTYPE to classify the event. The corresponding header lines are the following.
##INFO=<ID=EVENT,Number=A,Type=String,Description="Event name">
##INFO=<ID=EVENTTYPE,Number=A,Type=String,Description="Type of associated event">
However, the use of EVENT is not limited to complex rearrengments and can be used to associate non-symbolic alleles, for example in cases of variant position ambiguity in high homology regions.
Since the EVENTTYPE values are implementation-defined, custom EVENTTYPE header lines are included to describe each EVENTTYPE.
##EVENTTYPE=<ID=GENE_CONVERSION,Description="Gene conversion event">
##EVENTTYPE=<ID=VARIANT_IN_HOMOLOGY_REGION,Description="Variant in homology region">
##EVENTTYPE=<ID=VNTR,Description="Variable number tandem repeat">

In the case of target variants in a high homology region, the variant is reported in all homologous regions with a ploidy corresponding to the total copy number of all the homologous regions.
In the depicted example there are two genes A and B that include a high homology region. The usual process to call variants in this regions is to make a joint pileup of the reads aligning in both genes A and B and call the variants using a model with a ploidy proportional to the total copy number of the regions. This generates divergent possible genotypes that are equally likely since the variant cannot be confidently placed in either gene A or gene B. The variant is reported as follows:
chr1 100 . A T . TargetedRepeatConflict EVENT=GeneA-B:50A>T;EVENTTYPE=VARIANT_IN_HOMOLOGY_REGION GT 0/0/0/1
chr1 200 . A T . TargetedRepeatConflict EVENT=GeneA-B:50A>T;EVENTTYPE=VARIANT_IN_HOMOLOGY_REGION GT 0/0/0/1
Given the unconventional ploidy representation of variants in high homology regions, a TargetedRepeatConflict filter is applied to the records. The header line for the filter is the following.
##FILTER=<ID=TargetedRepeatConflict,Description="Set if call conflicts with a targeted call">

In the case of an identified gene conversion the differentiating sites as variants in the acceptor region are reported.
In the depicted example there are two genes A and B and gene A is the acceptor of a gene conversion from gene B (green box in the figure). Gene conversion are identified by observing variations in copy number at differentiating sites (blue and pink bars in the figure) in consecutive regions. Copy number variations between regions define the breakends of the gene conversion. There are two possible way to report the gene conversion:
1. | Reporting the differentiating sites variants in the acceptor gene as SNV entries. |
2. | Report a variation in copy number and gene conversion break ends as CNV and SV entries. |
Only the small variant representation is currently supported, and is as follows.
chr1 121 . A T . PASS EVENT=GC_AB;EVENTTYPE=GENE_CONVERSION; GT:PS 0|1:121
...
chr1 280 . G A . PASS EVENT=GC_AB;EVENTTYPE=GENE_CONVERSION; GT:PS 0|1:121
In the case of an observed gene conversion event, there may be differentiating sites with a genotype that is inconsistent with that gene conversion event. In these cases the RecombinantConflict filter is applied. The RecombinantConflict is defined by the following header line.
##FILTER=<ID=RecombinantConflict,Description="Set if call has a copy number that conflicts with a recombinant variant">
In the example, the resulting representation is as follows:
chr1 121 . A T . PASS EVENT=GC_AB;EVENTTYPE=GENE_CONVERSION; GT:PS 0|1:121
...
chr1 144 . C T . RecombinantConflict EVENT=GC_AB;EVENTTYPE=GENE_CONVERSION; GT:PS 1|1:121
chr1 153 . A G . RecombinantConflict EVENT=GC_AB;EVENTTYPE=GENE_CONVERSION; GT 0/0
...
chr1 280 . G A . PASS EVENT=GC_AB;EVENTTYPE=GENE_CONVERSION; GT:PS 0|1:121

The use of GT=0 for symbolic structural variant alleles is formally disambiguited in VCFv4.4, specifying that "GT=0 indicates the absence of any of the ALT symbolic structural variants defined in the record".
In the depicted example there are two regions, one with a heterozygous DUP followed by a REF copy number region. The CNV representation is as follows:
chr1 100 . A <DUP> . . END=300;EVENT=A;SVCLAIM=D;SVLEN=200;CN=2 GT:CN 0/1:3
chr1 300 . A <CNV> . . END=500;EVENT=A;SVCLAIM=D;SVLEN=200; GT:CN 0/0:2
The relevant header lines for the VCF records above are the following:
##INFO=<ID=CN,Number=A,Type=Float,Description="Copy number of CNV / breakpoint">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##INFO=<ID=SVLEN,Number=A,Type=Integer,Description="Difference in length between REF and ALT alleles">
##INFO=<ID=SVCLAIM,Number=A,Type=String,Description="Claim made by the structural variant call. Valid values are D, J, DJ for abundance, adjacency and both respectively.">
##FORMAT=<ID=CN,Number=1,Type=Float,Description="Estimated copy number">

In the depicted example there are two overlapping CNVs that can be represented as follows:
chr1 100 . A <DUP> . . END=300;EVENT=A;SVCLAIM=D;SVLEN=200;CN=2 GT:PS .|1:100
chr1 200 . G <DEL> . . END=500;EVENT=A;SVCLAIM=D;SVLEN=300;CN=0 GT:PS 1|.:100

In the depicted example there is a Variable Number Tandem Repeat (VNTR) region composed of three repeat units in the reference. The CN INFO field is used to report the allele copy number, the CN FORMAT field to is used report the region total copy number given by the sum of the allele copy numbers, and the REPCN FORMAT field is used to report the repeat unit copy number equal to the allele copy number multiplied by the number of repeat units in the reference.
This VNTR can be represented as follows:
chr1 100 . A <DUP>,<DUP> . . END=400;EVENT=A;EVENTTYPE=VNTR;SVCLAIM=D;SVLEN=300;CN=2.6,4.3 GT:CN:REPCN 1|2:6.9:8|13
The REPCN header line is the following.
##FORMAT=<ID=REPCN,Number=1,Type=String,Description="Number of repeat units spanned by the allele">

The TargetedLowQual filter is applied if the QUAL of a targeted variant is less than 3.00.
##FORMAT=<ID=REPCN,Number=1,Type=String,Description="Number of repeat units spanned by the allele">
The TargetedLowGQ filter is applied if the targeted variant has GQ smaller than 3.
##FILTER=<ID=TargetedLowGQ,Description="Set if call has GQ < 3">

The small variant target specific VCF calls can be merged into the <prefix>.hard-filtered.vcf.gz and <prefix>.hard-filtered.gvcf.gz files, briefly hard-filtered files, using the --targeted-merge-vc flag by providing the caller name(s). For example, --targeted-merge-vc rh will enable merging of the calls from the rh caller into the hard-filtered files and --targeted-merge-vc rh hba will enable merging of the calls from the rh and hba callers into the hard-filtered files. The all value will merge all calls from all supported targeted callers into the hard-filtered files, while the none value will merge no calls into the hard-filtered files.
The default value for --targeted-merge-vc is rh, therefore the calls from the rh caller are merged into the hard-filtered files by default.
The targeted calls merged into the hard-filtered files are marked with a TARGETED INFO flag.
When enabled, targeted small variants are merged into the hard-filtered files regardless of any regions that may be provided using the --vc-target-bed option.

The merging strategy for targeted small variant calls is to prioritize the targeted calls over small variant calls from the germline small variant caller. When a germline small variant call overlaps a targeted caller call, then the small variant call is filtered with a TargetedConflict filter if any of the following holds:
• | The targeted caller call is PASS. |
• | The small variant call and targeted caller call have incompatible genotypes and the targeted caller call is not filtered with the TargetedLowGQ filter. |
The strategy is summarized in the following examples.
1. | The TARGETED call is PASS |
chr1 100 . A C . TargetedConflict . GT 0/1
chr1 100 . A C . PASS TARGETED GT 1/1
2. | The TARGETED call and the small variant call are not overlapping |
chr1 110 . T TCA . PASS . GT 0/1
chr1 111 . G A . PASS TARGETED GT 0/1
3. | The TARGETED call is filtered with TargetedLowQual and has a discordant variant representation with the overlapping small variant call |
chr1 120 . ATTC A . TargetedConflict . GT 0/1
chr1 121 . T A . TargetedLowQual TARGETED GT 0/1
chr1 125 . TCAC T . TargetedLowQual TARGETED GT 0/1
chr1 126 . C G . TargetedConflict . GT 0/1
4. | The TARGETED call is filtered with TargetedLowQual and has a discordant genotype with the overlapping small variant call |
chr1 130 . C G . TargetedConflict . GT 0/1
chr1 130 . C G . TargetedLowQual TARGETED GT 1/1
5. | The TARGETED call is filtered with TargetedLowGQ and has a discordant genotype with the overlapping small variant call |
chr1 140 . AC A . PASS . GT:GQ 0/1:5
chr1 140 . A T . TargetedLowGQ TARGETED GT:GQ 1/1:2