DRAGEN Secondary Analysis Output Files

This section provides information on each DRAGEN pipeline, including output file information. In addition to generating files specific to each pipeline, DRAGEN provides metrics from the analysis in a <sample_name>.metrics.json file and the reports described in DRAGEN BCL Convert Pipeline. For more information DRAGEN, refer to the DRAGEN Bio-IT Platform support site page.

All DRAGEN pipelines support the decompression of input BCL and compression of output BAM/CRAM files.

All DRAGEN pipelines also support the generation of FASTQ.ora files with DRAGEN Original Read Archive (ORA) compression. ORA compression reduces the size of FASTQ files up to 5x. For more information, refer to the Illumina Support Site.

Output file considerations:

•

For Germline, RNA, Enrichment, and DNA Amplicon pipelines running on-instrument analysis, BAM files will not be uploaded to BaseSpace Sequence Hub if Proactive, Run Monitoring and Storage is selected.

DRAGEN Enrichment Pipeline

The DRAGEN Enrichment pipeline supports the following features. If using DRAGEN 3.7 or later, both germline and somatic (tumor only) modes are supported.

•

Sample demultiplexing

•

Mapping and alignment, including sorting and duplicate marking

•

Small variant calling

•

Structural variant calling

•

Copy number variant calling (version 3.10 or later)

To perform variant calling, a *.bed file must be included in the sample sheet or specified in the Run Planning on BaseSpace Sequence Hub. Structural variant calling is only generated for paired-end reads and germline mode.

If using DRAGEN Enrichment version 3.8 or later, you can input a noise baseline file to improve performance in somatic mode. Refer to Import Noise Baseline Files.

If using Copy Number Variant (CNV) calling, a panel of normals must be supplied. Refer to Import Panel of Normals for CNV Calling.

The pipeline generates the following output files.

Component

Type

Output File Name

Mapping/aligning

BAM or CRAM

•

<sample_name>.bam, or

•

<sample_name>.cram

Small variant calling

VCF and gVCF*

•

<sample_name>.hard-filtered.gvcf.gz

•

<sample_name>.hard-filtered.vcf.gz

Structural variant calling

VCF

•

<sample_name>.sv.vcf.gz

Copy number variant calling

VCF

•

<sample_name>.cnv.vcf.gz

* gVCF output files are only available for germline mode.

DRAGEN Germline Pipeline

The DRAGEN Germline pipeline supports the following features:

•

Sample demultiplexing

•

Mapping and alignment, including sorting and duplicate marking

•

Small variant calling

•

Structural variant calling for paired-end reads

•

Copy number variant calling for human genomes

•

Repeat expansions for human genomes

•

Regions of homozygosity for human genomes

•

[DRAGEN v3.8 or later] CYP2D6 detection

Structural variant calling is only generated for paired-end reads.

The pipeline generates the following output files.

Component

Type

Output File Name

Mapping/aligning

BAM or CRAM

•

<sample_name>.bam, or

•

<sample_name>.cram

Small variant calling

VCF and gVCF

•

<sample_name>.hard-filtered.gvcf.gz

•

<sample_name>.hard-filtered.vcf.gz

Structural variant caller

VCF

•

<sample_name>.sv.vcf.gz

Copy number variant caller

VCF

•

<sample_name>.cnv.vcf.gz

Repeat expansion

VCF

•

<sample_name>.repeats.vcf.gz

Regions of Homozygosity

CSV and BED

•

<sample_name>.roh_metrics.csv

•

<sample_name>.roh.bed

CYP2D6 Detection

TSV

•

<sample_name>.cyp2d6.tsv

DRAGEN DNA Amplicon Pipeline

The DRAGEN pipeline supports the following features:

•

Sample demultiplexing

•

Mapping and alignment, including sorting and duplicate marking

•

Small variant calling in germline or somatic mode.

To perform variant calling, a *.bed file must be included in the sample sheet or specified in the Run Planning on BaseSpace Sequence Hub.

The pipeline generates the following output files.

Component

Type

Output File Name

Mapping/aligning

BAM or CRAM

•

<sample_name>.bam, or

•

<sample_name>.cram

Small variant calling

VCF and gVCF*

•

<sample_name>.hard-filtered.gvcf.gz

•

<sample_name>.hard-filtered.vcf.gz

*gVCF output files are only available in germline mode.

DRAGEN RNA Pipeline

The DRAGEN RNA pipeline supports the following features

•

Sample demultiplexing

•

Mapping and alignment, including sorting and duplicate marking

•

Gene fusion detection

•

Transcript quantification

•

[DRAGEN v3.8, or later] Differential gene expression

To generate output files, specify a GTF file in the sample sheet or make sure the default genes.gtf.gz exists with the reference genome.

The pipeline generates the following output files.

Component

Type

Output File Name

Description

Mapping/aligning

BAM or CRAM

•

<sample_name>.bam, or

•

<sample_name>.cram

Alignment output meeting SAM specifications.

Gene fusion detection

Plain text

•

<sample_name>.fusion_candidates.preliminary

•

<sample_name>.fusion_candidates.final

•

Fusion candidates before filters are applied.

•

Fusion candidates after filters are applied.

Transcript quantification

Plain text

•

sample_name.quant.genes.sf

•

sample_name.quant.sf

•

Transcript quantification results at the gene level.

•

All transcript quantification results.

Differential expression

PNG

Refer to the following differential expression output files table.

To generate output files, a comparison must be set up in the sample sheet.

The following files are output when differential expression is enabled.

File Name	Description
Control_vs_Comparison.differential_expression_metrics.csv	Contains differential expression analysis metrics.
Control_vs_Comparison.genes.counts.csv	Describes the number of reads mapped to each gene for each sample in the control and comparison groups.
Control_vs_Comparison.genes.heatmap.png	A heat map of the expression of the differentially expressed genes for samples in the control and comparison groups. The heat map only shows differentially expressed genes with an adjusted P-value < -0.05. If there are more than 30 differentially expressed genes, only the top 30 differentially expressed genes are used. If DESeq1 fails to converge or if there are no differentially expressed genes, the file is not generated.
Control_vs_Comparison.genes.ma.png	Contains the variation of gene expression ratios as a function of average signal intensity. To show the differences between measurements taken in two samples, the plot transforms the data onto M (log ratio) and A (mean average) scales, and then plots the values. The MA plot shows the log2 fold changes attributable to a given variable over the mean of normalized counts for all the samples. If the adjusted P-value is less than 0.1, the points are red. Points that fall out of the window are plotted as open triangles. Upwards pointing triangles represent a positive log fold change. Downwards pointing triangles represent a negative log fold change.
Control_vs_Comparison.genes.pca.png	Plot displays the first two principal components that explain the most variance.
Control_vs_Comparison.genes.res.csv	Contains DESeq2 results, which describe the mean expression, log2 (fold change), standard error of log2, P-value, adjusted P-value, and the expression status of each gene.
Control_vs_Comparison.genes.rlog.csv	Contains regularized log-transformed counts calculated by DESeq2.

DRAGEN Single Cell RNA Pipeline

The DRAGEN supports the following features:

•

Sample demultiplexing

•

Mapping and alignment, including sorting and duplicate marking

•

Cell and gene classification

To generate output files, specify a GTF file in the sample sheet or make sure the default genes.gtf.gz exists with the reference genome.

The pipeline generates the following output files.

Component

Type

Output File Name

Mapping/aligning

BAM or CRAM

•

<sample_name>.bam, or

•

<sample_name>.cram

Cell/gene classification

TSV, CSV, and MTX

•

<sample_name>.scRNA.barcodeSummary.tsv

•

<sample_name>.scRNA.genes.tsv

•

<sample_name>.scRNA.matrix.mtx

Analysis reports

HTML

<sample_name>.dragen.scrna-report.*.html

DRAGEN BCL Convert Pipeline

The DRAGEN BCL Convert pipeline uses BCL data generated from your sequencing run and sample sheet information to output a FASTQ file for each sample. The FASTQ file name is <sample_name>.fastq.gz.

The pipeline generates the following reports.

Component

Type

Output File Name

Demultiplexing

CSV

•

Demultiplex_Stats.csv

Adapter metrics

CSV

•

Adapter_Metrics.csv

Index hopping

CSV

•

Index_Hopping_Counts.csv

Top unknown barcodes

CSV

•

Top_Unknown_Barcodes.csv

Demultiplexing Statistics Report

The demultiplexing statistics report contains information on the number of passing filter reads that are assigned to each sample in the sample sheet. Any reads not clearly associated with a sample are classified as undetermined. The report also includes information about the quality scores of bases in the passing filter (PF) reads assigned to each sample.

The following information is included.

Metric	Description
Lane	The lane on the flow cell the sample was sequenced.
SampleID	The sample ID from the sample sheet. If a read does not correspond with a sample, the field displays undetermined.
Index	The concatenation of Index Read 1 and Index Read 2 from the sample sheet separated by a hyphen. If a read does not correspond to a sample, the field displays undetermined.
# Reads	The number of PF reads demultiplexed for the sample in the specified lane.
# Perfect Index Reads	Number of reads with a perfect match to the combined index sequences specified in the sample sheet.
# One Mismatch Index Reads	Number of reads with one error in the combined index sequences specified in the sample sheet.
# of ≥ Q30 Bases (PF)	Number of bases, including adapters, corresponding to reads that pass a Q30 quality threshold.
Mean Quality Score (PF)	The mean quality score for reads corresponding to the sample in the specified lane. The value includes adapter bases.

Adapter Metrics Reports

The adapter metrics file contains the number of adapter and sample bases associated with each read.

The following information is included.

Metric	Description
Lane	The lane on the flow cell the sample was sequenced.
Sample_ID	The sample ID from the sample sheet. If a read does not correspond with a sample, the field displays undetermined.
index	The index1 sequence from the sample sheet. The field is empty if the index was not specified in the sample sheet or the sample ID value is undetermined.
index2	The index2 sequence from the sample sheet. The field is empty if index2 was not specified in the sample sheet or the sample ID value is undetermined.
R1_AdapterBases	Number of bases corresponding to AdapterRead1 in the sample sheet.
R1_SampleBases	Number of trimmed or masked bases from Read 1 for the corresponding lane and sample.
R2_AdapterBases	Number of bases corresponding to AdapterRead2 in the sample sheet.
R2_SampleBases	Number of trimmed or masked bases from Read 2 for the corresponding lane and sample.
# Reads	Number of reads for the sample in the specified lane.

Index Hopping Counts Report

The index hopping counts report contains the number of reads for each expected and hopped index for dual index runs. The report only includes unique dual indexes per lane where no barcode collision is detected in either index. To generate index-hopping metrics for a lane, every pair of entries within each index must have a hamming distance of at least 2N +1, where N represents the barcode mismatch tolerance specified for the index.

The following information is included.

For nonindex runs, single index runs, or lanes that do not contain unique dual indexes, the file contains only the headers.

Metric	Description
Lane	The lane on the flow cell the sample was sequenced.
# Reads	Number of reads for the sample in the specified lane.
SampleID	The sample ID from the sample sheet. If a read does not correspond with a sample, the field displays undetermined.
index	The index1 sequence from the sample sheet. The field is empty if a read is single-ended or the sample ID value is undetermined.
index2	The index2 sequence from the sample sheet. The field is empty if a read is single-ended or the sample ID value is undetermined.

Top Unknown Barcodes Report

The top unknown barcodes report contains the top 100 index or index pairs per lane that were not identified in the sample sheet according to the number of allowed mismatches. If there are multiple index values placed as the 100th highest index count entry, all index values with the same count are output as the 100th entry.

The following information is included:

Metric	Description
Lane	The lane on the flow cell the sample was sequenced.
index	The sequence for each unknown index in index Read 1. The field is empty if no unknown indexes are found.
index2	The sequence for each unknown index in index Read 2. If the run was single-read or there were no unknown indexes found, the field is empty.
# Reads	Number of reads for the sample in the specified lane.

Illumina DRAGEN QC Reports

For all pipelines, DRAGEN FastQC generates QC plots by default. Aggregated QC results are stored in the AggregatedFastqcMetrics folder and per sample results are stored in the <sample_name> folder.

If the number of samples is greater than 512, QC reports are not generated.

The following QC plots are provided.

QC Plot	Description
adapter_content	The percentage of sequences for each base pair.
positional_mean_quality	Average Phred-scale base quality score for each read position.
gc_content	The GC content percentage for each sequencing read.
positional_quality.read_1	Average Phred-scale quality value of bases with a specific nucleotide and at a given location in Read 1.
gc_quality
positional_quality.read_2	Average Phred-scale quality value of bases with a specific nucleotide and at a given location in Read 2.
n_content
read_length	The sequence length for each read.
positional_base_content.read_1	Number of bases of each specific nucleotide at given locations in Read 1.
read_quality	Average Phred-scale quality score for each sequencing read.
positional_base_content.read_2	Number of bases of each specific nucleotide at given locations in Read 2.