CNV VCF File
The CNV VCF file follows the standard VCF format. Due to the nature of how CNV events are represented versus how structural variants are represented, not all fields are applicable. In general, if more information is available about an event, then the information is annotated. Some fields in the DRAGEN CNV VCF are unique to CNVs.
The following is an example of the header lines that are specific to CNV.
##fileformat=VCFv4.2
##CoverageUniformity=0.402517
##contig=<ID=1,length=249250621>
##contig=<ID=2,length=243199373>
##contig=<ID=3,length=198022430>
##contig=<ID=4,length=191154276>
##contig=<ID=5,length=180915260>
…
##reference=file:///reference_genomes/Hsapiens/hs37d5/DRAGEN
##ALT=<ID=CNV,Description="Copy number variant region">
##ALT=<ID=DEL,Description="Deletion relative to the reference">
##ALT=<ID=DUP,Description="Region of elevated copy number relative to the reference">
##INFO=<ID=REFLEN,Number=1,Type=Integer,Description="Number of REF positions included in this record">
##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##INFO=<ID=CIPOS,Number=2,Type=Integer,Description="Confidence interval around POS">
##INFO=<ID=CIEND,Number=2,Type=Integer,Description="Confidence interval around END">
##FILTER=<ID=cnvQual,Description="CNV with quality below 10">
##FILTER=<ID=cnvCopyRatio,Description="CNV with copy ratio within +/- 0.2 of 1.0">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=SM,Number=1,Type=Float,Description="Linear copy ratio of the segment mean">
##FORMAT=<ID=CN,Number=1,Type=Integer,Description="Estimated copy number">
##FORMAT=<ID=BC,Number=1,Type=Integer,Description="Number of bins in the region">
##FORMAT=<ID=PE,Number=2,Type=Integer,Description="Number of improperly paired end reads at start and stop breakpoints">
The POS column is the start position of the variant. According to the VCF specification, if any of the ALT alleles is a symbolic allele, such as <DEL>, then the padding base is required and POS denotes the coordinate of the base preceding the polymorphism. All coordinates in the VCF are 1-based.
The ID column is used to represent the event. The ID field encodes the event type and coordinates of the event.
The REF column contains an N for all CNV events.
The ALT column specifies the type of CNV event. Because the VCF contains only CNV events, only the DEL or DUP entry is used.
The QUAL column contains an estimated quality score for the CNV event, which is used in hard filtering.
The FILTER column contains PASS if the CNV event passes all filters, otherwise the column contains the name of the failed filter.
The INFO column contains information representing the event. The REFLEN entry indicates the length of the event. The SVTYPE entry is always CNV. The END entry indicates the end position of the event. If using a segment BED file, then the segment identifier is carried over from the input to SEGID field.
The FORMAT fields are described in the header.
• | GT—Genotype |
• | SM—Linear copy ratio of the segment mean |
• | CN—Estimated copy number |
• | BC—Number of bins in the region |
• | PE—Number of improperly paired end reads at start and stop breakpoints |
Because germline copy number calling determines overall copy number rather than the copy number on each haplotype, the genotype type field contains missing values for diploid regions when CN is greater than or equal to 2. The following are examples of the GT field for various VCF entries:
Diploid or Haploid? |
ALT |
FORMAT:CN |
FORMAT:GT |
---|---|---|---|
Diploid |
. |
2 |
./. |
Diploid |
<DUP> |
> 2 |
./1 |
Diploid |
<DEL> |
1 |
0/1 |
Diploid |
<DEL> |
0 |
1/1 |
Haploid |
. |
1 |
0 |
Haploid |
<DUP> |
> 1 |
1 |
Haploid |
<DEL> |
0 |
1 |