Normalizing Small Tandem Duplications

The SV caller can also represent tandem duplications as insertions. This representation creates ambiguity in how the variants are presented in the VCF output, especially for small tandem duplications. The representation can lead to complications, such as unrecognized call duplication.

To better normalize the SV caller output, so that the same variant type is not represented in two different VCF formats, small tandem duplications (< 1000 bases) are converted to insertions in the VCF output. Insertions converted from such tandem duplications have a formatting similar to incomplete insertions, using the symbolic allele <INS> for the ALT field. The following example shows an insertion, which was converted from a tandem duplication during this normalization process.

chr2 2520057 MantaDUP:TANDEM:53645:0:1:0:0:0 T <INS> 813 PASS END=2520057;SVTYPE=INS;SVLEN=52;DUPSVLEN=52 GT:FT:GQ:PL:PR:SR 0/1:PASS:393:863,0,390:25,0:19,25

Converted insertions include copies of certain output fields. The fields appear the same as in a tandem duplication record. For example, INFO/DUPSVINSSEQ provides a copy of the breakpoint insertion value computed for the duplication. In the context of a duplication, the breakpoint insertion value would normally be written to INFO/SVINSSEQ. The following example shows a converted insertion with a breakpoint insertion value:

chr2 2645730 MantaDUP:TANDEM:53649:0:1:0:0:0 C <INS> 367 PASS END=2645730;SVTYPE=INS;SVLEN=97;DUPSVLEN=86;DUPSVINSLEN=11;DUPSVINSSEQ=CTCACCTTCAT GT:FT:GQ:PL:PR:SR 0/1:PASS:367:417,0,386:19,0:20,15

For more information about copied INFO fields, see VCF INFO Fields. All INFO fields use the same DUP prefix.