Small Variant Filtering
The software component, Pepe, performs post‑processing on the small variant calling genome VCFs to polish backgrounds and adjust quality scores. The software filters out variants when error rates do not meet quality thresholds. This analysis step produces genome VCF files and associated error rate files. The minimum read depth for reference calls is
Pepe computes two quality scores for each candidate that dynamically adjust for the following conditions:
• | Background noise |
• | Trinucleotide change |
• | Read support type |
ALT |
FILTER |
Note |
---|---|---|
. |
PASS |
WT |
., A, C, G, etc¹ |
LowDP² |
Reference or filtered variant candidate with depth <1000. |
A, C, G, etc¹ |
PASS |
PASS variants. |
A, C, G, etc¹ |
LowSupport, LowVarSupport |
Filtered variant candidate with low-quality score. |
A, C, G, etc¹ |
Blacklist |
Position with high background noise. Not available for variant detection. |
A, C, G, etc¹ |
VarBias |
Filtered variant candidate showed bias clustered at fragment ends. |
¹ Etc refers to other variants types not mentioned in the table.
² For reference positions, a coverage below 1000X directly translates into LowDP, but if a position has a PASS filter for variant calls, LowDP is not applied. This depends on the LQ/AQ thresholds for COSMIC or non-COSMIC variants, and allele frequencies.
The net effect of the read collapsing and TMB analysis steps reduces false positives in a typical cell‑free DNA sample from ~1500 per Mb to < 5 per Mb.
For more information on DNA alignment and read collapsing, refer to Illumina DRAGEN Bio-IT Platform Product Documentation.
For each variant candidate, background noise at the same site is estimated from normal baseline samples of varying qualities. A p-value is calculated using the observed mutant depth, total depth, and background noise using binomial distribution. The p-value is then converted to a variant quality score (AQ). The sample-specific error rate of each trinucleotide change is estimated from different support categories in each sample by using all the positions with an allele frequency less than 1%. For each variant candidate, a likelihood ratio score (LQ) is computed by the corresponding error rate of the observed total and mutant read. A bias score (BFQ) is computed on each variant candidate to evaluate the imbalance of mutant vs total read support between different support groups.
For variants with a Catalogue of Somatic Mutations in Cancer (COSMIC) count > 50, the LQ and AQ thresholds are 20 and the remaining sites are 60. For indel, at least one stitched mutant support is required. For non-COSMIC variant, threshold for BFQ is < 20. In addition, positional information of mutant and WT allele in fragment will be extracted for each variant candidate. A Kolmogorov-Smirnov test will be applied to compute p-value between mutant and WT position. Variants with p-value < 0.05 and median difference > = 0.5 will be filtered and labeled VarBias. The net effect of the read collapsing and variant filtering significantly reduces false positives. For example, false positives in a typical cell-free DNA sample were reduced to < 5 per Mb from ~1500 per Mb.
In addition to the evaluation of the quality scores, certain regions covered in the product manifest are excluded from analysis due to high background noise. All excluded variants are identified in the VCF using a flag. The block list of excluded sites can be obtained on request from your local Illumina representative.
Some regions are known to be difficult to sequence. One example region is the TERT promoter region. Although sequencing can occur at the TERT promoter region, this location might result in low coverage due to the GC rich content of the sequenced region.