Target Counts
The target counts stage is the first processing stage for the DRAGEN CNV pipeline. This stage bins the alignments into intervals. The primary analysis format for CNV processing is the target counts file, which contains the feature signals that are extracted from the alignments to be used in downstream processing. The binning strategy, interval sizes, and their boundaries are controlled by the target counts generation options, and the normalization technique used.
When working with whole genome sequence data, the intervals are autogenerated from the reference hashtable. Only the primary contigs from the reference hashtable are considered for binning. You can specify additional contigs to bypass with the --cnv-skip-contig-list option.
With whole exome sequence data, DRAGEN uses the target BED file supplied with the --cnv-target-bed option to determine the intervals for analysis.
The target counts stage generates a .target.counts.gz file. You can use the file later in place of any BAM or CRAM by specifying the file with the --cnv-input option for the normalization stage. The .target.counts.gz file is an intermediate file for the DRAGEN CNV pipeline and should not be modified.
The .target.counts.gz file is a tab-delimited compressed text file with the following columns:
• | Contig identifier |
• | Start position |
• | End position |
• | Target interval name |
• | Count of alignments in this interval |
• | Count of improperly paired alignments in this interval |
An example of a *.target.counts.gz file is shown below.
contig start stop name SampleName improper_pairs
1 565480 565959 target-wgs-1-565480 7 6
1 566837 567182 target-wgs-1-566837 9 0
1 713984 714455 target-wgs-1-713984 34 4
1 721116 721593 target-wgs-1-721116 47 1
1 724219 724547 target-wgs-1-724219 24 21
1 725166 725544 target-wgs-1-725166 43 12
1 726381 726817 target-wgs-1-726381 47 14
1 753243 753655 target-wgs-1-753243 31 2
1 754322 754594 target-wgs-1-754322 27 0
1 754594 755052 target-wgs-1-754594 41 0
If the samples are whole genome, then the effective target intervals width is specified with the --cnv-interval-width option. The higher the coverage of a sample, the higher the resolution that can be detected. This option is important when running with a panel of normals because all samples must have matching intervals. For self-normalization, the effective width might be larger than the specified value.
The default value for WGS is 1000 bp with a sample coverage of ≥ 30x.
WGS Coverage per Sample |
Recommended Resolution* (bp) |
---|---|
5 |
10000 |
10 |
5000 |
≥ 30 |
1000 |
*Using a cnv-interval-width of ≤ 250 bp for WGS analysis can drastically increase run time
The intervals are autogenerated for every primary contig in the reference. DRAGEN only supports references that have the USCS or GRC convention. For example, chr1, chr2, chr3, ..., chrX, chrY or 1, 2, 3, ..., X, Y. To specify a list of contigs to skip, use the --cnv-skip-contig-list option. This option takes comma-separated list of contig identifiers. The contig identifiers must match the reference hashtable that you are using. By default, only the mitochondrial chromosomes are skipped. Nonprimary contigs are never processed.
For example, to skip chromosome M, X, and Y, use the following option:
--cnv-skip-contig-list "chrM,chrX,chrY"
If the samples are whole exome samples, supply a target BED file with the --cnv-target-bed $TARGET_BED option.
The intervals in the target BED file indicate regions where alignments are expected based on the target capture kit. The BED file intervals are further split into intervals of smaller size, depending on the value of cnv-interval-width.
To use a standard BED file, make sure that there is no header present in the file. In this case, all columns after the third column are ignored, similar to the operation of DRAGEN Variant Caller.
The following options control the generation of target counts.
Option |
Description |
---|---|
--cnv-counts-method |
Specifies the counting method for an alignment to be counted in a target bin. Values are midpoint, start, or overlap. The default value is overlap when using the panel of normals approach, which means if an alignment overlaps any part of the target bin, the alignment is counted for that bin. In the self-normalization mode, the default counting method is start. |
--cnv-min-mapq |
Specifies the minimum MAPQ for an alignment to be counted during target counts generation. The default value is 3 for self-normalization and 20 otherwise. When generating counts for panel of normals, all MAPQ0 alignments are counted. |
--cnv-target-bed |
Specifies a properly formatted BED file that indicates the target intervals to sample coverage over. For use in WES analysis. |
--cnv-interval-width |
Specifies the width of the sampling interval for CNV processing. This option controls the effective window size. The default is 1000 for WGS analysis and 500 for WES analysis. |
--cnv-skip-contig-list |
Specifies a comma-separated list of contig identifiers to skip when generating intervals for WGS analysis. The default contigs that are skipped, if not specified, are chrM,MT,m,chrm. |
In the WGS case where a BED file is not specified for a given reference, the same intervals should be generated each time. The intervals created take into account the mappability of the reference genome using a k-mer uniqueness map created during hashtable generation. A dropout region is a complex region that does not count alignments and results in an interval missing from the analysis. Dropout regions include centromeres, telomeres, and low complexity regions. If there is sufficient signal in the flanking regions, an event can still span these dropout regions, even if alignment counting does not occur in the regions. The event is handled by the segmentation stage.