The custom multigenome hash table builder tool uses a set of population variants provided by the user to generate a multigenome hash table. The variants must be specified in VCF format, with one VCF file given per individual in the population (multisample VCF format is not permitted). These VCF files must have specific formatting described below.
The custom multigenome hash table builder tool only supports VCF file input respecting the format described below:
|
•
|
single sample VCF supported, no gVCF |
|
•
|
with SNPs and indels (max 200 bases) only |
|
•
|
with phased variants, we recommend using external tools such as Whatshap to generate phased input VCF files |
|
•
|
with diploid genotype calls only (even for sex chromosomes) |
|
•
|
with MNVs converted to consecutive SNP records |
|
•
|
with variants positionally sorted in the same contig order as the main FASTA reference genome provided in --ht-reference |
|
•
|
with the following FILTER codes, non-PASS records are ignored: |
|
–
|
##FILTER=<ID=PASS,Description="All filters passed">
|
|
–
|
##FILTER=<ID=RefCall,Description="Genotyping model thinks this site is reference.">
|
|
–
|
##FILTER=<ID=LowQual,Description="Confidence in this variant being real is below calling threshold."
|
|
•
|
with the following FORMAT field, other FORMAT fields need to be removed: |
|
–
|
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
|
|
–
|
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Conditional genotype quality">
|
|
–
|
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read depth">
|
|
–
|
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block.">
|
|
–
|
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Read depth for each allele">
|
|
–
|
##FORMAT=<ID=VAF,Number=A,Type=Float,Description="Variant allele fractions.">
|
|
–
|
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Phred-scaled genotype likelihoods rounded to the closest integer">
|
|
–
|
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phase set identifier">
|
The INFO field is not used and needs to be set to “.”
To build a high-performance custom genome it is highly recommended to use VCF files generated from long-read sequencing to maximize the number of phased genotype calls.