CYP2B6 Caller

The first step of CYP2B6 calling is to determine the combined copy number of CYP2B6 and CYP2B7. Reads aligned to regions in either CYP2B6 or CYP2B7 are counted. The counts in each region are corrected for GC-bias, and then normalized to a diploid baseline. The GC-bias correction and normalization factors are determined from read counts in 3000 preselected 2 kb regions across the genome. These 3000 normalization regions were randomly selected from the portion of the reference genome having stable coverage across population samples. The combined CYP2B6 and CYP2B7 copy number is then calculated from the average sequencing depth across the CYP2B6 and CYP2B7 regions.

Differentiating Sites

The CYP2B6-derived copy number is calculated at 99 predefined differentiating sites across the CYP2B6 gene. The differentiating sites are selected at positions with sequence differences in CYP2B6 and CYP2B7 where calling the CYP2B6-derived copy number shows an accuracy of greater than 98% based on sequencing data from the 1000 Genomes Project.

For each differentiating site, CYP2B6-specific and CYP2B7-specific alleles are counted in reads mapping to either CYP2B6 or the homologous region in CYP2B7. The CYP2B6-derived copy number is then calculated from the two gene-specific allele counts using the total CYP2B6 and CYP2B7 copy number calculated from the previous step.

Structural Variant Calling

The CYP2B6-derived copy number along the CYP2B6 gene is used to identify known population structural variants (SVs), including whole gene deletions and duplications as well as certain gene conversions and gene fusions. The following fusion variants are detected:

Fusion Breakpoint	Hybrid Gene Structure	Star-Allele Designation
intron 4-exon 5	2B7-2B6	*29
intron 4-exon 5	2B6-2B7	*30

Small Variant Calling

35 small variants that define various star alleles are detected from the read alignments. All of these variants are in unique (nonhomologous) regions of CYP2B6 with high mapping quality. Only reads mapping to CYP2B6 are used for calling variants in nonhomologous regions.

For each variant, reads containing either the variant allele or the nonvariant alleles are counted. A binomial model that incorporates the sequencing errors is then used to determine the most likely variant copy number (0 for nonvariant).

Samples with poor sequencing quality or greater than five copies of CYP2B6 will have allele counts with higher variance. This elevated variance increases the chance that the most likely variant copy number is wrong. To handle these cases, the small variant caller also indicates alternate, less likely variant copy numbers.

Recombinant Variant Calling

The recombinant (gene conversion) variant 18053A>G is detected by phasing the variant site with five flanking differentiating sites. When the haplotypes formed from phasing these sites supports the gene conversion in CYP2B6, a read depth analysis at the gene conversion breakpoints (transitions from either CYP2B6->CYP2B7 or CYP2B7->CYP2B6) is performed. Given the posterior probability that there is at least one gene conversion variant is above 0.7 then DRAGEN uses the variant for star allele identification.

Star Allele Identification

The called SVs and small variant genotypes are matched against the definitions of 39 different star alleles. This might result in different sets of star alleles matching the called variant genotypes, such as with *1, *6 and *4, *49 where both sets of star alleles contain the same two small variants. When the small variant caller emits alternate, less likely variant copy numbers in addition to the most likely variant copy numbers this might result in different sets of star alleles being identified, because these alternate sets of variant copy numbers are also matched to the star allele definitions. The number of matched star alleles must match the number of CYP2B6-derived gene copies determined from previous steps. If no variant genotypes can be matched to a set of star alleles, the CYP2B6 Caller returns a no call during the genotyping step with filter value No_call.

Genotyping

Given a possible set of star alleles, the genotyping step attempts to identify the two likely haplotypes that contain all star alleles in the set. The likelihood of any given genotype is determined from a table of population frequencies determined from the 1000 Genomes Project and the genotype with the highest population frequency is selected. When two or more possible genotypes are identified with similar population frequencies, then all genotypes are emitted. This results in a call with filter value More_than_one_possible_genotype.

CYP2B6 Output File

The CYP2B6 Caller prints out its calls in the targeted callers output file, <prefix>.targeted.json (that also aggregates calls from other targeted callers). An example of this file with the CYP2B6 caller set is as follows:

Copy

{
    "dragenVersion": "4.2.0-724-gb600fcef",
    "sample": "NA19374",
    "pharmcatMetabolismStatusResourceUrl": "https://github.com/PharmGKB/PharmCAT/blob/aeecfe5f787e95dfb31ede62884e287affef45b3/src/main/resources/org/pharmgkb/pharmcat/definition/gene_phenotypes.json",
    "cyp2b6": {
    "genotype": "*17/*2",
    "genotypeFilter": "PASS",
    "pharmcatDescription": "An individual carrying two normal function alleles",
    "pharmcatMetabolismStatus": "Normal Metabolizer"
  },
}

For CYP2B6 caller, the fields are defined as follows.

Fields in JSON	Explanation	Type and Possible Values
dragenVersion	Version of DRAGEN	string
sample	sample id	string
pharmcatMetabolismStatusResourceUrl	an URL containing the genotype to PharmCAT mapping information	string (web link)
cyp2b6	a json array containing the CYP2B6 call for this sample	json-array
genotype	star allele genotype identified for sample	string
genotypeFilter	The filter status for the genotype call	string (The value can include: PASS, No_call, or More_than_one_possible_genotype)
pharmcatDescription	The description corresponding to the genotype, mapped from PharmCAT	string
pharmcatMetabolismStatus	The metabolism status corresponding to the genotype, mapped from PharmCAT	string

When the option --targeted-enable-legacy-output=true is set, the CYP2B6 Caller also generates a <output-file-prefix>.cyp2b6.tsv file in the output directory. The output file contains a single line containing the tab-delimited fields below. The output file contains no header line.

•

Sample name

•

One or more semicolon-delimited CYP2B6 genotypes or None for no call

•

The filter status. The value can include: PASS, No_call, or More_than_one_possible_genotype

Each CYP2B6 genotype contains two haplotypes separated by a slash (eg *1/*2). Each haplotype consists of one or more star alleles separated by a plus sign (eg *10+*36). When a haplotype contains more than one copy of the same star allele, that star allele only appears once and is followed by a multiplication sign, and then the number of copies (eg *1x2 for two copies of *1).

Command-line Examples

To enable the CYP2B6 Caller, use --enable-cyp2b6=true. The CYP2B6 Caller is disabled by default. The CYP2B6 Caller can run directly from FASTQ input with the mapper or from prealigned BAM/CRAM input. You can also enable the CYP2B6 Caller in parallel with any other germline variant callers as part of a WGS germline analysis workflow. For more information on other variant callers, see DNA Pipeline for DRAGEN.

FASTQ Input

The following command-line example uses FASTQ input:

Copy

dragen \
-r /staging/human/reference/hg38_alt_aware/DRAGEN/8 \
--fastq-file1 /staging/test/data/NA12878_R1.fastq \
--fastq-file2 /staging/test/data/NA12878_R2.fastq \
--output-directory /staging/test/output \
--output-file-prefix NA12878_dragen \
--RGID DRAGEN_RGID \
--RGSM NA12878 \
--enable-map-align=true \
--enable-cyp2b6=true

Prealigned BAM Input

The following command-line example uses BAM input that has already been aligned:

Copy

dragen \
-r /staging/human/reference/hg38_alt_aware/DRAGEN/8 \
--bam-input /staging/test/data/NA12878.bam \
--output-directory /staging/test/output \
--output-file-prefix NA12878_dragen \
--enable-map-align=false \
--enable-cyp2b6=true