Star Allele Caller

The Star allele caller identifies the genotypes and metabolism status of the following 16 Tier-1 PGx genes: CACNA1S, CFTR, CYP2C19, CYP2C9, CYP3A5, CYP4F2, IFNL3, RYR1, NUDT15, SLCO1B1, TPMT, UGT1A1, VKORC1, DPYD, G6PD, MT-RNR1. It finds optimal genotypes for the above genes, based on star allele definitions from PharmGKB. It calls metabolism status based on a PharmCAT resource file that provides mappings between genotypes and phenotypes.

Functionality

The star allele caller has the following features.

•

It calls star allele genotypes from different types of genomic data like FASTQ, BAM, gVCF, VCF

•

It provides additional details about the genotype call, including a confidence score

•

It assumes genotypes for missing positions to be ref - these positions are listed in the output

•

It assumes filtered genotype calls to be ref - these records are also listed in the output

•

If multiple optimal diplotypes are satisfied, then it lists them all

•

It supports different versions of the human reference hg38

Input files and command line options

The star allele caller can accept as input, different forms of sequence data such as FASTQs files, BAM/CRAM files or gVCF/VCF files. In the simplest case, the caller takes a DRAGEN gVCF file as input. The following is an example of the command line for this use case.

dragen \

--star-allele-gvcf /staging/test/data/NA12878.gvcf \

--output-directory /staging/test/output \

--output-file-prefix NA12878_dragen \

--enable-star-allele true

Contrary to a variant-only VCF file, a DRAGEN gVCF file contains the genotypes for all positions in a genome. Although the gVCF format is the preferred format for the caller, it can also accept a standard variant-only VCF file as input. The command line for that case will be the same as above, with the VCF file passed instead of a gVCF file ie

dragen \

--star-allele-gvcf /staging/test/data/NA12878.vcf \

--output-directory /staging/test/output \

--output-file-prefix NA12878_dragen \

--enable-star-allele true

The VCF/gVCF file should meet the following specifications.

•

Must be aligned to human reference hg38.

•

Variants should follow a parsimonious left aligned variant representation format.

•

Complex variants - for example, representing closely located, independent variants, in a single record - are NOT supported.

Note that VCF/gVCF files can also be substituted with, a compressed GZ file (ie <file_name>.vcf.gz or <file_name>.gvcf.gz). If a BAM file is passed as input, then in addition to enabling the star allele caller, the variant caller also needs to be enabled. Also, the human reference needs to be passed as a command line option. This should be an hg38 based reference - and the same reference that the BAM file is aligned to. An example of the command line for this use case as follows.

dragen \

-r /staging/human/reference/hg38_alt_aware/DRAGEN/8 \

--bam-input /staging/test/data/NA12878.bam \

--output-directory /staging/test/output \

--output-file-prefix NA12878_dragen \

--enable-map-align false \

--enable-star-allele true \

--enable-variant-caller true \

--vc-emit-ref-confidence gvcf

Note that the directory passed with the -r option points to the directory containing the DRAGEN hash table for a reference, not the raw reference fasta. DRAGEN team can provide these hash tables or they may be built using the following command (where hg38.fa is a reference fasta).

dragen \

--build-hash-table true \

--ht-reference hg38.fa \

--output-directory <directory_for_ref_hashtables>

Once the hash tables are generated, the path <directory_for_ref_hashtables> can then be passed with the -r parameter. If the BAM file is not aligned to the available hg38 reference, automatic remapping can be done by setting the option --enable-map-align true. If the user has a BAM file that is pre-aligned to a custom hg38 reference, and wishes to use this reference for the star allele caller, then the user needs to create a DRAGEN reference hash table, which can be built using the above command.

For passing a FASTQ file as input, additional options, --RGID and --RGSM need to be set in the command line. An example of the command line for this use case as follows.

dragen \

-r /staging/human/reference/hg38_alt_aware/DRAGEN/8 \

-1 /scratch/NA11829.fq1.gz \

-2 /scratch/NA11829.fq2.gz \

--RGID DRAGEN_RGID \

--RGSM DRAGEN_RGSM \

--enable-map-align true \

--output-directory /staging/test/output \

--output-file-prefix NA11829 \

--enable-variant-caller true \

--enable-star-allele true \

--vc-emit-ref-confidence GVCF

Lastly, the star allele caller and other PGx callers ie CYP2B6 and CYP2D6 callers can be set to run together from a BAM file by passing in the option —enable-pgx true. This option automatically turns on the different callers but can only be used with a BAM or FASTQ file as input. An example for this use case is as follows.

dragen \

-r /staging/human/reference/hg38_alt_aware/DRAGEN/8 \

--bam-input /staging/test/data/NA12878.bam \

--output-directory /staging/test/output \

--output-file-prefix NA12878_dragen \

--enable-map-align false \

--enable-pgx true \

--vc-emit-ref-confidence GVCF

Note that additional options with the variant caller such as --vc-forcegt-vcf should not be used when the star allele caller is enabled.

Output files

Following completion of the DRAGEN star allele caller run, the following four output files are produced.

The main output file, <prefix>.star_allele.json contains the complete and detailed results for all 16 genes. This is an example output for one gene DPYD and for one sample HG01085.

{

"Samplename": "HG01085",

"PharmCAT metabolism status resource URL": "https://github.com/PharmGKB/PharmCAT/blob/development/src/main/resources/org/pharmgkb/pharmcat/definition/gene_phenotypes.json",

"Genotype calls": [

{

"Gene": "DPYD",

"Last update": "10/13/2021",

"Allele definitions URL": "https://www.pharmgkb.org/page/dpydRefMaterials",

"Genotype": "c.85T>C(*9A)/Reference",

"PharmCAT description": "An individual carrying two normal function alleles",

"PharmCAT metabolism status": "Normal Metabolizer",

"Variants": "chr1:97883329:A:G,<NON_REF>:0/1:48:32:PASS",

"Variant star alleles found": "c.85T>C(*9A)",

"minGQ": "43",

"Missing genotypes": "",

"Filtered genotypes": ""

}

]

}

The fields in the json file are as follows.

•

"Samplename": Sample name

•

"PharmCAT metabolism status resource URL": Web URL for the PharmCAT resource used for calling metabolism status

•

"Gene": Gene name

•

"Last update": Last update for allele definitions for the gene

•

"Allele definitions URL": Web URL for the star allele definitions file

•

"Genotype": Detected optimal genotype for the gene

•

"PharmCAT description": A description of the called genotype from PharmCAT resource file

•

"PharmCAT metabolism status": Metabolism status for called genotype from PharmCAT resource file

•

"Variants": List of relevant variants for the gene

•

"Variant star alleles found": List of satisfied star allele haplotypes corresponding to variants

•

"minGQ": A confidence score for the genotype call

•

"Missing genotypes": List of relevant positions for which GT is missing in gVCF input

•

"Filtered genotypes": List of relevant positions for which GT is filtered in gVCF input

Each Star allele genotype contains one or two haplotypes (a haplotype for chrM gene MT-RNR1 and chrX gene G6PD for male samples, and a diplotype for all other genes) separated by a slash (eg *1/*2). Each haplotype is a pre-defined star allele and the definitions can be found under the allele definitions URL. Note that there may be some variance in star allele definitions and notations based on the resource and when it was last updated. The star allele caller follows the PharmGKB definitions and notations exactly. When the star allele caller cannot identify an optimal genotype for a gene, a no-call (./. or .) is made. In certain cases, more than one genotype is optimally satisfied, in that case all satisfied genotypes are listed, separated by a colon (eg *1/*2:*3/*4).

A second summarized output file <prefix>.star_allele.tsv contains summarized star allele calls for each gene. This is an example of one gene from the tsv output. The fields are gene name and genotype.

UGT1A1 *36/*80+*37

A gVCF file, <prefix>.select.gvcf contains specific gVCF records that were used to make the star allele genotype calls. This file follows the specifications of a VCF file and is a subset of the VCF/gVCF file that is supplied as input (or generated by DRAGEN VC from the FASTQs/BAM/CRAM files).

A file, <prefix>.starAlleleDefinitions.vcf contains the allele definitions for each gene and their corresponding genotype records from the sample. An excerpt from this file is as follows.

Copy

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  HG01085
chr1    97883368        rs150036960     G       C       .       .        Gene=DPYD;PgxRefAllele=Reference;AltAllele=c.46C>G     CHROM:POS:REF:ALT:GT:DP:GQ:FILTER       chr1:97883367:A:<NON_REF>:0/0:81:32:PASS
chr1    97883353        rs72549310      G       A       .       .        Gene=DPYD;PgxRefAllele=Reference;AltAllele=c.61C>T     CHROM:POS:REF:ALT:GT:DP:GQ:FILTER       chr1:97883351:T:<NON_REF>:0/0:81:28:PASS
chr1    97883352        rs80081766      C       T       .       .        Gene=DPYD;PgxRefAllele=Reference;AltAllele=c.62G>A     CHROM:POS:REF:ALT:GT:DP:GQ:FILTER       chr1:97883351:T:<NON_REF>:0/0:81:28:PASS

Each line in this file defines a variant for a star allele of a gene. The INFO field identifies the gene, the name of the reference star allele for the gene (PgxRefAllele) and the name of the star allele being defined (AltAllele) that includes this variant. The last column lists the gVCF record of the sample that corresponds to the variant. Note that the gVCF file may summarize multiple homref records into a single line and hence the record position or allele may not exactly match the position of the variant.

Note: The format of this file is VCF-like, ie the columns follow VCF format, however, the genomic positions follow the order of the star allele definitions (ie the records are not ordered) and certain variants are recorded multiple times if they are included in multiple star allele definitions.