Gene Fusion Detection

The DRAGEN Gene Fusion module uses the DRAGEN RNA spliced aligner for detection of gene fusion events. It performs a split-read analysis on the supplementary (chimeric) alignments to detect potential breakpoints. The putative fusion events then go through various filtering stages to mitigate potential false positives. In addition to the final results, all potential candidates (unfiltered) are output, which can be used to maximize sensitivity.

Running DRAGEN Gene Fusion

You can run the DRAGEN Gene Fusion module together with a regular RNA-Seq map/align job. To enable the DRAGEN Gene Fusion module, set --enable-rna-gene-fusion to true in your current RNA-Seq command-line scripts. The DRAGEN Gene Fusion module requires a gene annotations file in GTF or GFF format.

The following is an example command line for running an end to end RNA-Seq experiment.

/opt/edico/bin/dragen \

-r <HASHTABLE> \

-1 <FASTQ1> \

-2 <FASTQ2> \

-a <GTF_FILE> \

--output-dir <OUT_DIRECTORY> \

--output-file-prefix <PREFIX> \

--RGID <READ_GROUP_ID> \

--RGSM <Sample_NAME> \

--enable-rna true \

--enable-rna-gene-fusion true

At the end of a run, a summary of detected gene fusion events is output, which is similar to the following example.

==================================================================

Loading gene annotations file

==================================================================

Input annotations file: ref_annot.gtf

Number of genes: 27459

Number of transcripts: 196520

Number of exons: 1196293

==================================================================

Launching DRAGEN Gene Fusion Detection

==================================================================

Min nonintact split support 3

rna-gf-blast-pairs: blast_pairs.outfmt6

rna-gf-min-blast-pairs-eval: 1e-100

rna-gf-exon-snap: 50

rna-gf-coverage-lookup-window: 1000

rna-gf-min-support: 2

rna-gf-min-support-be: 10

rna-gf-min-breakpoint-mapq: 20

rna-gf-min-unique-alignments: 2

rna-gf-restrict-genes true

==================================================================

Completed DRAGEN Gene Fusion Detection

==================================================================

Chimeric alignments: 683343

Total fusion candidates: 2370

Final fusion candidates: 26

RUN TIME Time loading reference 00:00:37.696 37.70

RUN TIME Time loading anchor reference 00:00:36.720 36.72

RUN TIME Time loading gene annotations 00:00:04.784 4.78

RUN TIME Time aligning reads 00:00:57.370 57.37

RUN TIME Time aligning anchored reads 00:00:21.931 21.93

RUN TIME Time merging anchored reads 00:00:21.546 21.55

RUN TIME Time duplicate marking 00:00:05.812 5.81

RUN TIME Time sorting and marking duplicates 00:01:00.949 60.95

RUN TIME Time saving map/align output 00:01:31.879 91.88

RUN TIME Time running gene fusion event generation 00:00:00.943 0.94

RUN TIME Time running gene fusion filter 00:00:00.612 0.61

RUN TIME Time partitioning 00:02:00.800 120.80

RUN TIME Total runtime 00:04:38.725 278.73

***********************************************************

DRAGEN finished normally

Gene Fusion Output and Filters

The <outputPrefix>fusion_candidates.features.csv file lists the detected gene fusion events. The output CSV file includes the following columns. Any additional columns describe additional features of the fusion candidates.

•

#FusionGene—Parent gene names (in 5' to 3' order of transcript) participating in the fusion. If a fusion breakpoint overlaps multiple genes, the genes with passing genes (or failing genes if no passing candidate exists) are listed by default as separate candidates (rows). To show them as a semi-colon separated gene list on the same row, the option --rna-gf-merge-calls can be set to true as described in the Options section.

•

Score—Fusion call confidence score based on the number of supporting split reads and read-pairs as well as other fusion features. The score can be 0 (low confidence) to 1 (high-confidence call).

•

LeftBreakpoint—Gene 1 breakpoint formatted as <Chromosome>:<Position>:<Strand>.

•

RightBreakpoint—Gene 2 breakpoint formatted as <Chromosome>:<Position>:<Strand>.

•

Filter—Semicolon separated list of filters. Each output is either a Confidence or Information Only filter. The Filter value is PASS if none of the confidence filters are triggered. Otherwise, the output value is FAIL.

The following are the available filters.

Filter	Type	Description
DOUBLE_BROKEN_EXON	Confidence	If both breakpoints are 50 bp from annotated exon boundaries, then the number of supporting reads do not satisfy a high threshold requirement (≥ 10 supporting reads).
LOW_MAPQ	Confidence	All fusion supporting read alignments at either of the breakpoints have MAPQ < 20.
LOW_UNIQUE_ALIGNMENTS	Confidence	All fusion supporting read alignments near at least one of the two breakpoints have the same start and end position.
LOW_SCORE	Confidence	The fusion candidate has low probabilistic score (< 0.5) as determined by the features of the candidate.
MIN_SUPPORT	Confidence	The fusion candidate has < 2 fusion supporting read pairs.
UNENRICHED_GENES	Confidence	If an enrichment list is provided, then neither of the two parent genes is enriched. If amplicon mode is enabled, then at least one of the two parents genes is not enriched. See DRAGEN Amplicon Pipeline for further information.
READ_THROUGH	Confidence	The breakpoints are cis neighbors (< 200,000 bp) on the reference genome.
MITOCHONDRIAL_GENES	Confidence	The fusion candidate involves mitochondrial genes. Set --rna-gf-filter-chrm false to disable this filter.
ANCHOR_SUPPORT	Information only	Read alignments of fusion supporting reads are (less than 12 bp) at either of the two breakpoints.
HOMOLOGOUS	Information only	The candidate is likely a false candidate generated because the two genes involved have high gene homology.
LOW_ALT_TO_REF	Information only	The number of fusion supporting reads is < 1% of the number of reads supporting the reference transcript at either of the two breakpoints.
LOW_GENE_COVERAGE	Information only	Either of the two breakpoints have less than 125 bp with nonzero read coverage.

A logistic regression model that has been trained on a large set of RNA data is used for each gene fusion event scoring. The remaining columns in this file provide the value of the features that are either used in this logistic regression model, or used for further filtering of the events, to determine a PASS/FAIL result. Each feature is detailed in the following table along with either the associated logistic regression coefficient, or the associated filter. Some of the features also have some notes for clarification.

Specific features and column values are subject to change in futureDRAGEN versions as more RNA data is analyzed.

Feature	Coefficient	Default Value	Filter Use	Explanatory Notes
SplitScore	LogSplitScore	1.962		Combined count of fusion supporting fragments reported as split reads and soft-clipped reads
NumSplitReads	LogNumSplitReads	0.0		Fusion supporting fragments with at least 1 split read alignment. Not used in model since we useSplitScore
NumSoftClippedReads				Fusion supporting fragments with no split read alignment, but at least 1 soft clipped alignment. Included inSplitScoreand includes soft-clipped reads for both Gene1 and Gene2
NumSoftClippedReadsGene1				Fusion supporting fragments with no split read alignment, but at least 1 soft clipped alignment to Gene 1 (informational)
NumSoftClippedReadsGene2				See above (NumSoftClippedReadsGene1) for Gene 2
NumPairedReads	LogNumPairedReads	6.989		Fusion supporting fragments such that the 2 reads map fully to Gene1 and Gene2, but no read overlaps the breakpoint
NumRefSplitReadsGene1				Fragments which map fully within Gene 1 such that at least 1 read aligns across the BP (accumulated as Ref reads)
NumRefPairedReadsGene1				Fragments which map fully within Gene 1 such that the 2 reads map fully on the opposite sides of the breakpoint (accumulated as Ref reads)
NumRefSplitReadsGene2				See above (NumRefSplitReadsGene1) for Gene 2
NumRefPairedReadsGene2				See above (NumRefPairedReadsGene1) for Gene 2
AltToRef	LogAltToRef	2.424	LOW_ALT_TO_REF	Ratio of (fusion split + softclipped reads) / max(NumRefSplitReadsGene1,NumRefSplitReadsGene2)
UniqueAlignmentsGene1			LOW_UNIQUE_ALIGNMENTS	Unique (start-end) positions of fusion supporting read alignments to Gene 1 (after dedup)
UniqueAlignmentsGene2			LOW_UNIQUE_ALIGNMENTS	Unique (start-end) positions of fusion supporting read alignments to Gene 2 (after dedup)
MaxMapqGene1			LOW_MAPQ	Maximum MAPQ for reads in Gene 1
MaxMapqGene2			LOW_MAPQ	Maximum MAPQ for reads in Gene 2
CoverageBasesGene1	LogCoverageBases	0.492		Bases in Gene 1 with depth of coverage greater than a threshold (>=1) within a certain distance (size 1000bp) of the breakpoint in the direction of the breakpoint strand which is part of the fusion transcript
CoverageBasesGene2	LogCoverageBases	0.492		See above (CoverageBasesGene1) for Gene 2
DeltaExonBoundaryGene1	LogDeltaExonBoundary	1.026		Distance from the Gene 1 breakpoint for the closest fusion supporting alignment (higher distance to boundary lowers score)
DeltaExonBoundaryGene2	LogDeltaExonBoundary	1.026		See above (DeltaExonBoundaryGene1) for Gene 2
IsRestrictedGene1	IsRestricted	9.380		Indicator variable of whether the Gene 1 is tagged as protein coding or lincRNA in the GTF
IsRestrictedGene2	IsRestricted	9.380		Indicator variable of whether the Gene 2 is tagged as protein coding or lincRNA in the GTF
IsEnrichedGene1				If enrichment or amplicon assay, then indicates whether Gene 1 is enriched. If whole transcriptome sequencing, then set to 1 (used in fusion length and coverage calculations
IsEnrichedGene2				See above (IsEnrichedGene1) for Gene 2
CisDistance			READ_THROUGH	Distance between breakpoints if they are adjacent to each other and on the same strand. Large value (100M) if not a CIS break
BreakpointDistance				Distance between breakpoints if they are adjacent. Large value (100M) if not within same chromosome
GenePairHomologyEval	LogGenePairHomologyEval	0.108		E-value of pairwise BLAST alignment of the parent genes
AnchorLength1	AnchorLength	0.032		Longest alignment of a fusion supporting read to Gene 1
AnchorLength2	AnchorLength	0.032		Longest alignment of a fusion supporting read to Gene 2
FusionLengthGene1				Distance from breakpoint to the end of Gene 1
FusionLengthGene2				Distance from breakpoint to the end of Gene 2
NonFusionLengthGene1				BP distance to the end of transcript not part of the fusion (Informative)
NonFusionLengthGene2				BP distance to the end of transcript not part of the fusion (Informative)
AdditionalGenes1				Additional genes that overlap Gene 1 breakpoint but did not result in a passing fusion call. Column is only reported if fusion candidate merging is enabled.
AdditionalGenes2				Additional genes that overlap Gene 2 breakpoint but did not result in a passing fusion call. Column is only reported if fusion candidate merging is enabled.
Gene1Id				Gene ID reported in the GTF annotation file
Gene2Id				Gene ID reported in the GTF annotation file
Gene1Location				IntactExon: Breakpoint matches exon boundary, BrokenExon: Breakpoint is within an exon but does not match the exon boundary, Intron: Breakpoint is within an intron, Intergenic: Breakpoint does not overlap any gene
Gene2Location				See above (Gene1Location) for Gene 2
Gene1Sense				Trueif the Gene 1 5' to 3' direction matches the BP order, indicating that the gene is the upstream gene in the fusion transcript (informative)
Gene2Sense				See above (Gene1Sense) for Gene 2
SvEvent				If SV VCF is provided, then semi-colon separated string representation of SV events matching the fusion candidate.
SvType				If SV VCF is provided, then semi-colon separated list of type of each matching SvEvent.
SomaticScore				If SV VCF is provided, then highest SomaticScore value for matching SvEvents.
SvDistance				If SV VCF is provided, then maximum distance between SV breakpoints and fusion breakpoints (if multiple matching SV events, then minimum over all SV Events).
LeftSvDistance				If SV VCF is provided, then distance between left fusion breakpoint and corresponding SV breakpoint (if multiple matching SV events, then minimum over all SV Events).
RightSvDistance				If SV VCF is provided, then distance between right fusion breakpoint and corresponding SV breakpoint (if multiple matching SV events, then minimum over all SV Events).
SvPresent	SvPresent	0.0	0.377	If SV VCF is provided, then 1 if matching SV event is present, else 0.
SvAbsent				If SV VCF is provided, then 1 if no matching SV event is present, else if no SV VCF provided or if matching SV event is provided, then 0.
	Intercept		-8.467	Intercept for logistic regression model.

Gene Fusion Options and Filters

The following options can be used to configure the fusion caller:

Option	Description
--rna-gf-enriched-genes	For RNA enrichment assays, a list of targeted genes specified as one gene-name per line. Only fusion calls involving at least one gene on the list are reported.
--rna-gf-blast-pairs	A file listing gene pairs that have a high level of similarity. This list of gene pairs is used as a homology filter to reduce false positives. For information on generating this file, visit the Fusion Filter GitHub page. Use the ref annot.cdsplus.fa.allvsall.outfmt6.genesym.gz file produced by CTAT. For runs on human genome assemblies GRCH38 and hg19, DRAGEN automatically applies a default file generated using Gencode version 32 annotations for primary chromosomes if no other file is specified using the command line.
--rna-repeat-intervals	BED file that contains a target list of repeat intervals for sensitive fusion detection. Exclusive from --rna-repeat-genes. This option overrides the default files, which contain the genes CIC, DUX4, PSPH, and SEPTIN14 for GRCh38 and hg19 reference genomes.
--rna-repeat-genes	Text file that contains the names or IDs (from annotation GTF file) of targeted repetitive genes for sensitive fusion detection. Exclusive from --rna-repeat-intervals. This option overrides the default BED file.
--enable-variant-annotation --variant-annotation-assembly --variant-annotation-data	Enable Illumina Annotation Engine (IAE) to report fusion annotations in JSON format. --enable-variant-annotation must be set to true. For more information, see Illumina Annotation Engine.
--rna-gf-restrict-genes	When parsing the gene annotations file (GTF/GFF) for use in the DRAGEN Gene Fusion module, you can use this option to restrict the entries of interest to only protein-coding regions. Restricting the GTF to only the protein-coding and lincRNA genes reduces false positive rates in currently studied fusion events. The default value is true.
--rna-gf-merge-calls	If multiple genes overlap a fusion breakpoint, DRAGEN generates and scores a separate fusion candidate for each gene pair overlapping the breakpoint. When reporting such candidates which share the breakpoints when the option is true, DRAGEN merges these into a single row reporting the feature values for the highest scoring passing candidate (or highest scoring failing candidate if no passing candidate is reported). For each breakpoint, in the column #FusionGene, it reports a semi-colon separated list of names of all overlapping genes with a passing candidate. If a mix of passing and failing candidates are reported the same breakpoint pair, genes with only failing candidates are listed in the columns AdditionalGenes1 and AdditionalGenes2. If no passing candidate exists, then then all overlapping genes are reported in the #FusionGene column. The default value is false so that each reported fusion event only has one left and right gene in the fusion, and overlapping genes are output as separate events.
--enable-rna-amplicon	A separate fusion filtering model is trained for RNA amplicon mode. Duplicate removal for fusion supporting reads is disabled for RNA amplicon mode. Both genes are required to be in the list of enriched genes. By default, the DRAGEN fusion caller filters candidates if a transcript overlaps both of the breakpoints (e.g. fusions such as FIP1L1--PDGFRA and GOPC--ROS1). In RNA amplicon mode, such candidates are not filtered. SeeDRAGEN Amplicon Pipeline for further information. The default is false.
--rna-gf-sv-vcf	Structural variant VCF file output from DRAGEN DNA structural variant caller run in tumor mode. DRAGEN will report SV events matching each fusion candidate and adjust the score based on the present/absense of matching SVs.
--rna-gf-filter-chrm	DRAGEN filters fusion candidates involving chrM/MT with filter MITOCHONDRIAL_GENES if it can autodetect this chromosome. To disable filtering fusions involving mitochondrial genes, set to false. Default is true.