VCF Files

Variant call format (*.vcf) files contain information about variants found at specific positions in a reference genome.

The VCF file header includes the VCF file format version and the variant caller version and lists the annotations used in the remainder of the file. The VCF header also includes the reference genome file and BAM file. The last line in the header contains the column headings for the data lines. Each of the VCF file data lines contains information about a single variant.

VCF File Headings

Heading

Description

CHROM

The chromosome of the reference genome. Chromosomes appear in the same order as the reference FASTA file.

POS

The single-base position of the variant in the reference chromosome. For single nucleotide variants (SNVs), this position is the reference base with the variant. For indels, this position is the reference base immediately preceding the variant.

ID

The rs (reference SNP) number for the SNP obtained from dbSNP.txt, if applicable. If multiple rs numbers exist at this location, the list is delimited by semicolons. If a dbSNP entry does not exist at this position, a missing value marker ('.') is used.

REF

The reference genotype. For example, a deletion of a single T is represented as reference TT and alternate T. An A to T single nucleotide variant is represented as reference A and alternate T.

ALT

The alleles that differ from the reference read. For example, an insertion of a single T is represented as reference A and alternate AT. An A to T single nucleotide variant is represented as reference A and alternate T.

QUAL

A Phred-scaled quality score assigned by the variant caller. Higher scores indicate higher confidence in the variant and lower probability of errors. For a quality score of Q, the estimated probability of an error is 10-(Q/10). For example, the set of Q30 calls has a 0.1% error rate. Many variant callers assign quality scores based on their statistical models, which are high in relation to the error rate observed.

VCF File Annotations

Heading

Description

FILTER

If all filters are passed, PASS is written in the filter column.

Germline workflow possible FILTER entries include:

DRAGENSnpHardQUAL—Applied if SNP variant QUAL score does not meet threshold
DRAGENIndelHardQUAL—Applied if indel variant QUAL score does not meet threshold
LowDepth—Site filtered because depth of coverage does not meet threshold
LowGQ—Site filtered because genotype quality does not meet threshold
PloidyConflict—Genotype call from variant caller not consistent with chromosome ploidy
base_quality—Site filtered because median base quality of alt reads at this locus does not meet threshold
filtered_reads—Site filtered because too large a fraction of reads has been filtered out
fragment_length—Site filtered because absolute difference between the median fragment length of alt reads and median fragment length of ref reads at this locus exceeds threshold
low_depth—Site filtered because the read depth is too low
low_frac_info_reads—Site filtered because the fraction of informative reads is below threshold
low_normal_depth—Site filtered because the normal sample read depth is too low
long_indel—Site filtered because the indel length is too long
mapping_quality—Site filtered because median mapping quality of alt reads at this locus does not meet threshold
multiallelic—Site filtered because more than two alt alleles pass tumor LOD
non_homref_normal—Site filtered because the normal sample genotype is not homozygous reference
no_reliable_supporting_read—Site filtered because no reliable supporting somatic read exists
panel_of_normals—Seen in at least one sample in the panel of normals vcf
read_position—Site filtered because median of distances between start/end of read and this locus is below threshold
RMxNRepeatRegion—Site filtered because all or part of the variant allele is a repeat of the reference
strand_artifact—Site filtered because of severe strand bias
str_contraction—Site filtered due to suspected PCR error where the alt allele is one repeat unit less than the reference
too_few_supporting_reads—Site filtered because there are too few supporting reads in the tumor sample
weak_evidence—Somatic variant score does not meet threshold

FILTER (continued)

Somatic workflow possible FILTER entries include:

base_quality—Site filtered because median base quality of alt reads at this locus does not meet threshold
filtered_reads—Site filtered because too large a fraction of reads have been filtered out
fragment_length—Site filtered because absolute difference between the median fragment length of alt reads and median fragment length of ref reads at this locus exceeds threshold
low_depth—Site filtered because the read depth is too low
low_frac_info_reads—Site filtered because the fraction of informative reads is below threshold
low_normal_depth—Site filtered because the normal sample read depth is too low
long_indel—Site filtered because the indel length is too long
mapping_quality—Site filtered because median mapping quality of alt reads at this locus does not meet threshold
multiallelic—Site filtered because more than two alt alleles pass tumor LOD
non_homref_normal—Site filtered because the normal sample genotype is not homozygous reference
no_reliable_supporting_read—Site filtered because no reliable supporting somatic read exists
panel_of_normals—Seen in at least one sample in the panel of normals vcf
read_position—Site filtered because median of distances between start/end of read and this locus is below threshold
RMxNRepeatRegion—Site filtered because all or part of the variant allele is a repeat of the reference
strand_artifact—Site filtered because of severe strand bias
str_contraction—Site filtered due to suspected PCR error where the alt allele is one repeat unit less than the reference
too_few_supporting_reads—Site filtered because there are too few supporting reads in the tumor sample
weak_evidence—Somatic variant score does not meet threshold
systematic_noise—Site filtered based on evidence of systematic noise in normals

INFO

Germline workflow possible INFO entries include:

AC—Allele count in genotypes for each ALT allele, in the same order as listed.
AF—Allele Frequency for each ALT allele, in the same order as listed.
AN—The total number of alleles in called genotypes.
DB—dbSNP Membership.
FS—Phred-scaled p-value using Fisher's exact test to detect strand bias.
QD—Variant Confidence/Quality by Depth.
R2_5P_bias—Score based on mate bias and distance from 5 prime end.
SOR—Symmetric Odds Ratio of 2x2 contingency table to detect strand bias.
DP—Approximate read depth (informative and non-informative); some reads may have been filtered based on mapq etc.
END—Stop position of the interval.
FractionInformativeReads—The fraction of informative reads out of the total reads.
MQ—RMS Mapping Quality.
MQRankSum—Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities.
ReadPosRankSum—Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias.
SOMATIC—At least one variant at this position is somatic.

Somatic workflow possible INFO entries include:

DP—Approximate read depth (informative and non-informative); some reads may have been filtered based on mapq etc.
END—Stop position of the interval.
FractionInformativeReads—The fraction of informative reads out of the total reads.
MQ—RMS Mapping Quality.
MQRankSum—Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities.
ReadPosRankSum—Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias.
AQ—Systematic noise score.
hotspot—Known somatic site, used to increase confidence in call.
SOMATIC—At least one variant at this position is somatic.

FORMAT

The format column lists fields separated by colons. For example, GT:GQ.

Germline workflow available fields include:

AD—Allelic depths (counting only informative reads out of the total reads) for the ref and alt alleles in the order listed.
AF—Allele fractions for alt alleles in the order listed.
DP—Approximate read depth (reads with MQ=255 or with bad mates are filtered).
F1R2—Count of reads in F1R2 pair orientation supporting each allele.
F2R1—Count of reads in F2R1 pair orientation supporting each allele.
GP—Phred-scaled posterior probabilities for genotypes as defined in the VCF specification.
GQ —Genotype quality.
GT—Genotype. 0 corresponds to the reference base, 1 corresponds to the first entry in the ALT column, and so on. The forward slash (/) indicates that no phasing information is available.
MB—Per-sample component statistics to detect mate bias.
PL—Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification.
PRI—Phred-scaled prior probabilities for genotypes.
PS—Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group.
SB—Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.
SQ—Somatic quality.

Somatic workflow available fields include:

AD—Allelic depths (counting only informative reads out of the total reads) for the ref and alt alleles in the order listed.
AF—Allele fractions for alt alleles in the order listed.
DP—Approximate read depth (reads with MQ=255 or with bad mates are filtered).
F1R2—Count of reads in F1R2 pair orientation supporting each allele.
F2R1—Count of reads in F2R1 pair orientation supporting each allele.
GT—Genotype. 0 corresponds to the reference base, 1 corresponds to the first entry in the ALT column, and so on. The forward slash (/) indicates that no phasing information is available.
MB—Per-sample component statistics to detect mate bias.
PS—Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group.
SB—Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.
SQ—Somatic quality.

SAMPLE

The sample column gives the values specified in the FORMAT column.