Star Allele Caller
The Star Allele Caller identifies the genotypes and metabolism status of the following PGx genes. They are included in FDA's PGx recommendations or have CPIC Level A designation : CACNA1S, CFTR, CYP2C19, CYP2C9, CYP3A5, CYP4F2, IFNL3, RYR1, NUDT15, SLCO1B1, TPMT, UGT1A1, VKORC1, DPYD, G6PD, MT-RNR1, BCHE, ABCG2, NAT2, F5, and UGT2B17. Star Allele Caller finds optimal genotypes for the genes based on Star Allele definitions from the following resources. It calls metabolism status based on a PharmCAT resource file that provides mappings between genotypes and phenotypes. The primary support for the Star Allele Caller is for human reference hg38 for which it supports the above mentioned genes. In addition, Star Allele Caller supports the following genes on references hg19 and GRCh37 : CACNA1S, CYP2C19, CYP2C9, CYP3A5, CYP4F2, IFNL3, NUDT15, SLCO1B1, VKORC1, DPYD, ABCG2, F5.
For genes CACNA1S, CFTR, CYP2C19, CYP2C9, CYP3A5, CYP4F2, IFNL3, RYR1, NUDT15, SLCO1B1, TPMT, UGT1A1, VKORC1, DPYD, G6PD, MT-RNR1, ABCG2 the allele definitions are sourced from PharmGKB. Because BCHE does not have defined star alleles, the Star Allele Caller checks if a sample is positive for any of the variants that are reported.
For genes CYP2C19, CYP2C9, CYP3A5, CYP4F2, NUDT15, SLCO1B1, DPYD, the definitions are sourced from PharmVAR. For the remaining hg19/GRCh37 genes, ie, ABCG2, CACNA1S, IFNL3, F5 and VKORC1 - the allele definitions have been lifted from their corresponding definitions for hg38, which are sourced from PharmGKB.
The Star Allele Caller has the following features.
| • | It calls star allele genotypes from different types of genomic data like FASTQ, BAM, gVCF, VCF |
| • | It provides additional details about the genotype call, including a confidence score |
| • | It assumes genotypes for missing positions to be ref - these positions are listed in the output |
| • | It assumes filtered genotype calls to be ref - these records are also listed in the output |
| • | If multiple optimal diplotypes are satisfied, then it lists them all |
| • | It supports different versions of the human reference hg38 |
| • | For the genes UGT2B17 and CYP2C19, the caller analyzes CNV calls to detect star alleles. |
The Star Allele Caller can accept as input, different forms of sequence data such as FASTQs files, BAM/CRAM files or gVCF/VCF files. In the simplest case, the caller takes a DRAGEN gVCF file as input. The following is an example of the command line for this use case.
dragen \
-r /staging/human/reference/hg38_alt_aware/DRAGEN/9 \
--star-allele-gvcf /staging/test/data/NA12878.gvcf \
--star-allele-cnv-vcf /staging/test/data/NA12878.cnv.vcf.gz \
--output-directory /staging/test/output \
--output-file-prefix NA12878_dragen \
--enable-star-allele true
Contrary to a variant-only VCF file, a DRAGEN gVCF file contains the genotypes for all positions in a genome. Although the gVCF format is the preferred format for the caller, it can also accept a standard variant-only VCF file as input. The command line for that case will be the same as above, with the VCF file passed instead of a gVCF file. Also, the CNV-VCF file is optional - in this case the Star Allele Caller will not call star alleles that are detected through CNV analysis.
The following is an example of this use case, with only a variant only VCF file as input.
dragen \
-r /staging/human/reference/hg38_alt_aware+cnv+hla+rna_v2/DRAGEN/9 \
--star-allele-gvcf /staging/test/data/NA12878.vcf \
--output-directory /staging/test/output \
--output-file-prefix NA12878_dragen \
--enable-star-allele true
The small variant VCF/gVCF and CNV-VCF files should meet the following specifications.
| • | Must be aligned to the same human reference that is passed through the -r option. |
| • | Variants should follow a parsimonious left aligned variant representation format. |
| • | Complex variants - for example, representing closely located, independent variants, in a single record - are NOT supported. |
VCF/gVCF files can be substituted with a compressed GZ file (ie <file_name>.vcf.gz or <file_name>.gvcf.gz). If a BAM or FASTQ file is passed as input, then the preferred command to use for Star Allele Caller is the option --enable-pgx which turns on all necessary components that the Star Allele Caller needs. The human reference needs to be passed as a command line option. The Star Allele Caller detects the reference version (ie, hg19, GRCh37 or hg38) and accordingly reads in the correct allele definitions.
dragen \
-r /staging/human/reference/hg38_alt_aware+cnv+hla+rna_v2/DRAGEN/9 \
--bam-input /staging/test/data/NA12878.bam \
--output-directory /staging/test/output \
--output-file-prefix NA12878_dragen \
--enable-map-align false \
--enable-pgx true
Note that the directory passed with the -r option points to the directory containing the DRAGEN hash table for a reference, not the raw reference FASTA. DRAGEN team can provide these hash tables or they may be built using the following command (where hg38.fa is a reference FASTA).
dragen \
--build-hash-table true \
--ht-reference hg38.fa \
--output-directory <directory_for_ref_hashtables>
Once the hash tables are generated, the path <directory_for_ref_hashtables> can then be passed with the -r parameter. If the BAM file is not aligned to the available hg38 reference, automatic remapping can be done by setting the option --enable-map-align true. If the user has a BAM file that is pre-aligned to a custom hg38 reference, and wishes to use this reference for the Star Allele Caller, then the user needs to create a DRAGEN reference hash table, which can be built using the above command.
For passing a FASTQ file as input, additional options, --RGID and --RGSM need to be set in the command line. An example of the command line for this use case as follows.
dragen \
-r /staging/human/reference/hg38_alt_aware+cnv+hla+rna_v2/DRAGEN/9 \
-1 /scratch/NA11829.fq1.gz \
-2 /scratch/NA11829.fq2.gz \
--RGID DRAGEN_RGID \
--RGSM DRAGEN_RGSM \
--enable-map-align true \
--output-directory /staging/test/output \
--output-file-prefix NA11829 \
--enable-pgx true
The setting --enable-pgx turns on other PGx callers such as CYP2D6, CYP2B6, and HLA - in addition to the Star Allele Caller. Note that to run the HLA caller, the passed reference must contain anchored_hla, a specific subdirectory with HLA-specific reference files. If a user wants to ONLY turn on the Star Allele Caller and not the other PGx callers, then in addition to enabling the Star Allele Caller, the variant caller also needs to be enabled. Optionally, the CNV caller should also be preferably enabled for analyzing CNV star alleles. An example of the command line for this use case is as follows.
dragen \
-r /staging/human/reference/hg38_alt_aware/DRAGEN/9 \
--bam-input /staging/test/data/NA12878.bam \
--output-directory /staging/test/output \
--output-file-prefix NA12878_dragen \
--enable-map-align false \
--enable-star-allele true \
--enable-variant-caller true \
--vc-emit-ref-confidence gvcf \
--enable-cnv true \
--cnv-enable-self-normalization true
Additional options with the variant caller such as --vc-forcegt-vcf should not be used when the Star Allele Caller is enabled.
Following completion of the DRAGEN Star Allele Caller run, the following four output files are produced.
| 1. | When the Star Allele Caller is run along with other callers, then the main output file, <prefix>.targeted.json contains the complete and detailed results for all genes. (When the caller is run in stand-alone mode then this file is not produced - check #2 for the relevant output for that case). This is an example output for one gene DPYD and for one sample NA19374. |
{
"dragenVersion": "4.2.0-724-gb600fcef",
"sample": "NA19374",
"pharmcatMetabolismStatusResourceUrl": "https://github.com/PharmGKB/PharmCAT/blob/aeecfe5f787e95dfb31ede62884e287affef45b3/src/main/resources/org/pharmgkb/pharmcat/definition/gene_phenotypes.json",
"star_allele": {
"calls": [
{
"gene": "DPYD",
"lastUpdate": "10/13/2021",
"alleleDefinitionsUrl": "https://www.pharmgkb.org/page/dpydRefMaterials",
"genotype": "./.",
"pharmcatDescription": null,
"pharmcatMetabolismStatus": null,
"variants": "chr1:97515839:T:C,<NON_REF>:0/1:49:70:PASS;chr1:97883329:A:G,<NON_REF>:1/1:65:68:PASS;chr1:97573881:C:T,<NON_REF>:0/1:50:73:PASS",
"variantStarAllelesFound": "c.1218G>A:c.1627A>G(*5):c.85T>C(*9A)",
"minGQ": "49",
"missingGenotypes": "",
"filteredGenotypes": ""
}
]
}
The fields in the json file are as follows.
| • | "dragenVersion": Version of DRAGEN that is being executed |
| • | "sample": Sample name |
| • | "pharmcatMetabolismStatusResourceUrl": Web URL for the PharmCAT resource used for calling metabolism status |
| • | "gene": Gene name |
| • | "lastUpdate": Last update for allele definitions for the gene |
| • | "alleleDefinitionsUrl": Web URL for the star allele definitions file |
| • | "genotype": Detected optimal genotype for the gene |
| • | "pharmcatDescription": A description of the called genotype from PharmCAT resource file |
| • | "pharmcatMetabolismStatus": Metabolism status for called genotype from PharmCAT resource file |
| • | "variants": List of relevant variants for the gene |
| • | "variantStarAllelesFound": List of satisfied star allele haplotypes corresponding to variants |
| • | "minGQ": A confidence score for the genotype call |
| • | "missingGenotypes": List of relevant positions for which GT is missing in gVCF input |
| • | "filteredGenotypes": List of relevant positions for which GT is filtered in gVCF input |
The latest version of the resource file from PharmCAT no longer carries the pharmcatDescription field. As a result, the field is deprecated in DRAGEN v4.2.
Each Star allele genotype contains one or two haplotypes (a haplotype for chrM gene MT-RNR1 and chrX gene G6PD for male samples, and a diplotype for all other genes) separated by a slash (eg *1/*2). Each haplotype is a pre-defined star allele and the definitions can be found under the allele definitions URL. Note that there may be some variance in star allele definitions and notations based on the resource and when it was last updated. The Star Allele Caller follows the PharmGKB definitions and notations exactly. When the Star Allele Caller cannot identify an optimal genotype for a gene, a no-call (./. or .) is made. In certain cases, more than one genotype is optimally satisfied, in that case all satisfied genotypes are listed, separated by a colon (eg *1/*2:*3/*4)
| 2. | TSV and json files are produced when the Star Allele Caller is run stand-alone from a gVCF or vcf file or if the option --targeted-enable-legacy-output is set. This produces a <prefix>.star_allele.tsv which contains summarized star allele calls for each gene. This is an example for one gene from the TSV output. The fields are gene name and genotype. UGT1A1 *36/*80+*37 |
Additionally, a json file, <prefix>.star_allele.json is also produced when this option is set that contains the genotype calls for just the Star Allele Caller (contrary to the main output file, <prefix>.targeted.json that aggregates calls from all targeted callers). The format of this json file is the same as the json format described above.
| 3. | A gVCF file, <prefix>.select.gvcf contains specific gVCF records that were used to make the star allele genotype calls. This file follows the specifications of a VCF file and is a subset of the VCF/gVCF file that is supplied as input (or generated by DRAGEN VC from the FASTQs/BAM/CRAM files). |
