Gene Expression Quantification

The DRAGEN RNA pipeline contains a gene expression quantification module that estimates the expression of each transcript and gene in an RNA-seq data set. The module first internally translates the genomic mapping of each read (read pair) to the corresponding transcript mappings. Then uses an Expectation-Maximization (EM) algorithm to infer the transcript expression values that best match all the observed reads. The EM algorithm can also model and correct for GC-bias in the reported quantification results.

To enable the quantification module, set the --enable-rna-quantification option to true in your current RNA-seq command-line scripts. Additionally, you must provide a gene annotation file (GTF/GFF) that contains the genomic position of all transcripts to quantify. You can specify the GTF/GFF file using the -a or --annotation-file option.

Quantification Options

Option

Description

--enable-rna-quantification

If set to true, enables RNA quantification. Requires --enable-rna to be set to true.

--rna-quantification-library-type

Specifies the type of RNA-seq library. The following are the available values:

•

IU—Paired-end unstranded library.

•

ISR—Paired-end stranded library in which read2 matches the transcript strand (eg, Illumina Stranded Total RNA Prep).

•

ISF—Paired-end stranded library in which read1 matches the transcript strand.

•

U—Single-end unstranded library.

•

SR—Single-end stranded library in which reads are in reverse orientation to the transcript strand (eg, Illumina Stranded Total RNA Prep).

•

SF—Single-end stranded library in which reads match the transcript strand.

•

A—DRAGEN examines the first reads pairs in the data set to automatically detect the correct library type. Autodetect is the default value.

--rna-quantification-gc-bias

GC bias correction estimates the effect of transcript %GC on sequencing coverage and accounts for the effect when estimating expression. To disable GC bias correction, set to false.

--rna-quantification-fld-max

--rna-quantification-fld-mean

--rna-quantification-fld-sd

Use these options to specify the insert size distribution of the RNA-seq library for single-end runs. These options are relevant for GC bias correction. The defaults are 250 +- 25. The maximum allowed value is 1000. To improve accuracy, modify the values to match your library.

Quantification Outputs

Transcript quantification results are reported in the <outputPrefix>.quant.sf text file. The file lists results for each transcript. You can use the output file as input for differential gene expression using tools such as tximport and DESeq2.

The following is an example of the file contents:

Name		    Length  EffectiveLength 	TPM     	NumReads

ENST00000364415.1   116     12.3238 		5.2328		1

ENST00000564138.1   2775    2105.58 		1.28293 	41.8885

Field	Description
Name	The ID of the transcript.
Length	The length of the (spliced) transcript in base pairs.
EffectiveLength	The length as accessible to RNA-seq, accounting for insert-size and edge effects.
TPM	Transcripts per Million (TPM) represents the expression of the transcript when normalized for transcript length and sequencing depth.
NumReads	The estimated number of reads from the transcript. The values are not normalized.

The gene expression quantification module also outputs the following files. For information on the metrics included, see Quantification and RNA QC Metrics.

•

<outputPrefix>.quant.genes.sf—Contains quantification results at the gene level. The results are produced by summing together all transcripts with the same geneID in the annotation file (GTF). Length and EffectiveLength are the expression-weighted means of the individual transcripts in the gene.

•

<outputPrefix>.quant.metrics.csv—Summary statistics relevant to RNA transcripts and quantification. See Quantification and RNA QC Metrics.

•

<outputPrefix>.quant.transcript_fragment_lengths.txt—Full fragment length distribution of reads mapped to transcripts.

•

<outputPrefix>.quant.transcript_coverage.txt—Measures coverage uniformity with an average of 5' to 3' coverage pattern along transcripts.

•

<outputPrefix>.SJ.saturation.txt—Measures sequencing saturation of the library, including the number of unique splice junctions observed as a function of reads processed.

Quantification and RNA QC Metrics

The RNA Quantification module outputs metrics related to the gene expression results and more general RNA QC metrics that rely on transcript-level analysis.

A summary of the metrics is output to the <outputPrefix>.quant.metrics.csv file.

Metric	Description
Transcripts	Number of transcripts from the gene annotation file (GTF/GFF) input used for analysis.
Assumed library orientation	Library orientation of the RNA-seq reads relative to the original transcripts. The library orientation is automatically detected, or you can provide an orientation. See Quantification Options for more information.
Transcript fragments	Number of fragments (read pairs) mapped to one or more annotated transcripts.
Median transcript CV coverage	Coefficient of variation of coverage along transcripts. The metric measures how even RNA-seq read coverage is across a transcript.
<Forward / Reverse> transcript fragments	Read pairs that match transcripts on the forward or the reverse strand.
Strand mismatched fragments	If using a stranded library type, read pairs that do not match the expected strand of the transcript.
Orientation filtered fragments	Read pairs excluded from analysis because they do not match the expected orientation of the transcript (antisense).
Ambiguous orientation fragments	Read pairs that match transcripts in both forward and reverse orientation.
Intron fragments	Read pairs that overlap with a gene, but do not overlap with an exon.
Intergenic fragments	Read pairs that do not overlap with any gene.
Unknown transcript fragments	Read pairs that overlap with an exon of a gene, but do not match any transcript (mismatched splice sites).