Downsampling Options for Small Variant Calling

You can use the following options for downsampling reads in the small variant calling pipeline.

Option

Description

--vc-target-coverage

Specifies the maximum number of reads with a start position overlapping any given position.

--vc-max-reads-per-active-region

Specifies the maximum number of reads covering a given active region.

--vc-max-reads-per-raw-region

Specifies the maximum number of reads covering a given raw region.

--vc-min-reads-per-start-pos

Specifies the minimum number of reads with a start position overlapping any given position.

--high-coverage-support-mode

Applies the high coverage mode down-sample options if set to true. Enabling this option is recommended for targeted panels with coverage over 1000x, but will slow down run time.

For mitochondrial small variant calling, the downsampling options can be set separately because the mitochondrial contig contains a higher depth than the rest of the contigs in a WGS data set. The following are the downsampling options for the mitochondrial contig.

--vc-target-coverage-mito
--vc-max-reads-per-active-region-mito
--vc-max-reads-per-raw-region-mito

The target coverage and max/min reads in raw/active region options are not directly related and could be triggered independently.

The target coverage option runs first and is meant to limit the number of reads that share the same start position at any given position. It is not a limit on the total coverage at a given position.

The following are the default downsampling values for each small variant calling mode.

Mode

Downsampling Option

Default Value

Germline

--vc-target-coverage

500

Germline

--vc-max-reads-per-active-region

10000

Germline

--vc-max-reads-per-raw-region

30000

Somatic

--vc-target-coverage

50

Somatic

--vc-max-reads-per-active-region

10000

Somatic

--vc-max-reads-per-raw-region

30000

High Coverage

--vc-target-coverage

100000

High Coverage

--vc-max-reads-per-active-region

200000

High Coverage

--vc-max-reads-per-raw-region

200000

Mitochondrial

--vc-target-coverage-mito

40000

Mitochondrial

--vc-max-reads-per-active-region-mito

200000

Mitochondrial

--vc-max-reads-per-raw-region-mito

200000

The following example shows that the DP reported in a variant record can exceed the --vc-target-coverage default value of 500 in germline mode:

For example, assume the default value of --vc-target-coverage is 500. If there are 400 reads starting at position 1, another 400 starting at position 2, and another 400 starting at position 3, the target coverage option is not triggered (because 400 < 500). If there is a variant at position 4, reported depth of the variant could be as high as 1200. This example shows that the DP reported in a variant record can exceed the --vc-target-coverage value.

After the target coverage step, the maximum number of reads that share the same position is 500 (if --vc-target-coverage is set to 500).

The next downsampling step is to apply the --vc-max-reads-per-raw-region and --vc-max-reads-per-active-region limits. In this step, the maximum number of reads that share the same position can be further reduced from the 500 maximum value from the first step. These options are used to limit the total number of reads in an entire region using a leveling downsampling method.

The downsampling mechanism scans each start position from the start boundary of the region and discards one read from that position, then moves on to the next position, until the total number of reads falls below the threshold. It can potentially take several passes across the entire region for the total number of reads in the entire region to fall below the threshold. After the threshold is met, the downsampling step is stopped regardless of which position was considered last in the region.

If the number of reads at any position with same start position is equal to or lower than the --vc-min-reads-per-start-pos, that position is skipped to make sure that there is always at least a minimum number of reads (set to --vc-min-reads-per-start-pos) at any start position.

When downsampling occurs, the choice of which reads to keep or remove is random. However, the random number generator is seeded to a default value to make sure that the generator produces the same set of values in each run. This ensures reproducible results, which means there is no run to run variation when using the same input data.