Gene Expression Quantification
The DRAGEN RNA pipeline contains a gene expression quantification module that estimates the expression of each transcript and gene in an RNA data set. The module first internally translates the genomic mapping of each read (read pair) to the corresponding transcript mappings. Then uses an Expectation-Maximization (EM) algorithm to infer the transcript expression values that best match all the observed reads. The EM algorithm can also model and correct for GC-bias in the reported quantification results.
To enable the quantification module, set the --enable-rna-quantification option to true in your current RNA-seq command-line scripts. Additionally, you must provide a gene annotation file (GTF/GFF) that contains the genomic position of all transcripts to quantify. You can specify the GTF/GFF file using the -a or --annotation-file option.
Option |
Description |
|||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
--enable-rna-quantification |
If set to true, enables RNA quantification. Requires --enable-rna to be set to true. |
|||||||||||||||||||||
--rna-library-type |
Specifies the type of RNA-seq library. The following are the available values:
|
|||||||||||||||||||||
--rna-quantification-gc-bias |
GC bias correction estimates the effect of transcript %GC on sequencing coverage and accounts for the effect when estimating expression. To disable GC bias correction, set to false. |
|||||||||||||||||||||
--rna-quantification-fld-max --rna-quantification-fld-mean --rna-quantification-fld-sd |
Use these options to specify the insert size distribution of the RNA-seq library for single-end runs. These options are relevant for GC bias correction. The defaults are 250 +- 25. The maximum allowed value is 1000. To improve accuracy, modify the values to match your library. |
Transcript quantification results are reported in the <outputPrefix>.quant.sf text file. The file lists results for each transcript. You can use the output file as input for differential gene expression using tools such as tximport and DESeq2.
The following is an example of the file contents:
Name Length EffectiveLength TPM NumReads
ENST00000364415.1 116 12.3238 5.2328 1
ENST00000564138.1 2775 2105.58 1.28293 41.8885
Field |
Description |
---|---|
Name |
The ID of the transcript. |
Length |
The length of the (spliced) transcript in base pairs. |
EffectiveLength |
The length as accessible to RNA-seq, accounting for insert-size and edge effects. |
TPM |
Transcripts per Million (TPM) represents the expression of the transcript when normalized for transcript length and sequencing depth. |
NumReads |
The estimated number of reads from the transcript. The values are not normalized. |
The gene expression quantification module also outputs the following files. For information on the metrics included, refer to Quantification and RNA QC Metrics.
• | <outputPrefix>.quant.genes.sf—Contains quantification results at the gene level. The results are produced by summing together all transcripts with the same geneID in the annotation file (GTF). Length and EffectiveLength are the expression-weighted means of the individual transcripts in the gene. |
• | <outputPrefix>.quant.metrics.csv—Summary statistics relevant to RNA transcripts and quantification. Refer to Quantification and RNA QC Metrics. |
• | <outputPrefix>.quant.transcript_fragment_lengths.txt —Full fragment length distribution of reads mapped to transcripts, output in length- probability pairs of length minimum through >999 bases. Summing the products of the two columns will yield the average fragment length. |
• | <outputPrefix>.quant.transcript_coverage.txt—Measures coverage uniformity with a normalized average of 5' to 3' coverage pattern along transcripts in increments of 1%. A summation of the 100 coverage bins should yield 100%. |
• | <outputPrefix>.SJ.saturation.txt—Measures sequencing saturation of the library, including the number of unique splice junctions observed as a function of reads processed. |
The RNA Quantification module outputs metrics related to the gene expression results and more general RNA QC metrics that rely on transcript-level analysis.
A summary of the metrics is output to the <outputPrefix>.quant.metrics.csv file.
Metric |
Description |
---|---|
Library orientation |
Library orientation of the RNA-Seq reads relative to the original transcripts. The library orientation can be automatically detected, or can be explicitly provided. See Quantification Options for more information. |
Total Genes |
Total number of genes from the gene annotation (GTF/GFF) input used for analysis. |
Coding Genes |
Number of coding genes from the gene annotation (GTF/GFF) excluding pseudo-genes and biotypes which are noncoding. |
Total Transcripts |
Number of transcripts from the gene annotation file (GTF/GFF) input used for analysis. |
Median transcript CV coverage |
Median coefficient of variation (CV), or stdev divided by mean coverage, of the 1000 most highly expressed transcripts. This metric measures uniformity of RNA-Seq read coverage. |
Median 5' coverage bias |
Median 5 prime bias of the 1000 most highly expressed transcripts, calculated per transcript as mean coverage of the 5'-most 100 bases divided by the mean coverage of the whole transcript. |
Median 3' coverage bias |
Median 3 prime bias of the 1000 most highly expressed transcripts, calculated per transcript as mean coverage of the 3'-most 100 bases divided by the mean coverage of the whole transcript. |
Transcript fragments |
Number of fragments (read pairs) mapped to one or more annotated transcripts in the forward or reverse transcript sense. |
Strand mismatched fragments |
Number of read pairs that do not match the expected strand of the transcript in the case of stranded library orientation. If this is reported, then Forward transcript fragments will not be reported. |
Forward transcript fragments |
Number of read pairs that match transcripts on the forward strand In the case of unstranded library orientation. The percent column shows the fraction of forward fragments as compared to the Transcript fragments. The number of reverse fragments may be computed as Transcript fragments - Forward transcript fragments. |
Ambiguous strand fragments |
Read pairs that match transcripts in both forward and reverse orientation. |
Intron fragments |
Read pairs that overlap with a gene, but do not overlap with an exon. |
Intergenic fragments |
Read pairs that do not overlap with any gene. |
Unknown transcript fragments |
Read pairs that overlap with an exon of a gene, but do not match any transcript (mismatched splice sites). |
Number of genes with coverage > 1x,10x,30x,100x |
The count of the number of genes where the most highly expressed transcript has average coverage greater than 1x, 10x, 20x, and 100x . |
Fold coverage of all exons |
The average sequencing coverage across all annotated exons, determined using the most highly expressed transcript for each gene. |
Fold coverage of coding exons |
The average sequencing coverage across only exons within coding genes, determined using the most highly expressed transcript for each gene. |
Fold coverage of introns |
The average sequencing coverage across detected introns. |
Fold coverage of intergenic regions |
The average sequencing coverage across areas detected outside annotated genes. |