Input Files


For the Phase Common step (step 1), it is recommended to provide msVCF generated with the DRAGEN gVCF Genotyper tool. This first step takes as input a .txt file with path to a single msVCF or a list of msVCF, one line per path. The msVCF must comply with the following requirements:
• | per chromosome msVCF OR positionally sorted msVCF shards spanning a whole chromosome without overlap. See below for shard definition |
• | generated from the same reference build |
• | compressed and indexed |
• | with unphased GT calls |
• | with no duplicates |
• | all contigs of the studied genome must be listed in the header |
Note: for mixed ploidy chromosomes each PAR and non-PAR regions of the chromosome must be treated as a single chromosome. For example, on human data, the sample input msVCF for chrX must be divided into chrX_par1, chrX_par2, and chrX_nonpar.

The msVCF input list provided at step 1 is pre-processed to generate a formatted msVCF called <prefix>.preprocess.vcf.gz. This formatted msVCF is generated in the directory and must be used as input of the Phase Rare step (step 3).
To facilitate parallel processing on distributed compute nodes, and to avoid overhead chromosome level multisample VCF download and upload per sub-chromosome processing, chromosome portions of equal size (shards) can be used as input. The gVCF Genotyper tool, with proper option, can generate these shards of equal size.
Note: streaming from the cloud is not supported. Instead use predownload and local input process to achieve maximum IO efficiency and stability.

A per chromosome genetic map corresponding to the studied species and to the reference build used for the msVCF input is required. You can use your own genetic map computed from the recombination rate of the species and its reference genome, or use the genetic map corresponding to the human hg38 reference genome available in the Imputation files accessible in the Illumina DRAGEN Bio-IT Platform Support Site page. DRAGEN does not generate custom genetic map files.
The genetic map should follow the format:
• | 3 columns: position, chromosome number, distance (cM), in this order and tab separated |
• | Genetic map for mixed ploidy chromosome must be seperated into as many PAR and non PAR regions (e.g. for human, chromosome X is split into PAR1 chrX_par1, PAR2 chrX_par2 and non PAR chrX_nonpar regions) |
• | Genetic map for region in which all samples are haploid is not needed (e.g. for human, chromosome Y chrY) |
The user must ensure the genetic maps provided are from the same reference build than the reference used to generate the msVCF input.

This configuration file is a text file and is a required file. It allows for proper handling of haploid/diploid chromosomes and verifivation of concordence between genetic maps, msVCF input and sample type file information. Current configuration supports binary gender (male or female) and ploidy 2 or 1. When a region has different ploidies in male and female samples, the region is considered mixed ploidy region (e.g. for human, non PAR region on chromosome X chrX_nonpar).
You can use your own file, or use the file included in the genetic_maps-hg38-2.0 folder available at DRAGEN Bio-IT Platform support pages on the Illumina website.

##version=1.0
##ref_build=hg38
#filename region male_ploidy female_ploidy
chr1.gmap.gz chr1:1-248956422 2 2
chr2.gmap.gz chr2:1-242193529 2 2
chr3.gmap.gz chr3:1-198295559 2 2
chr4.gmap.gz chr4:1-190214555 2 2
chr5.gmap.gz chr5:1-181538259 2 2
chr6.gmap.gz chr6:1-170805979 2 2
chr7.gmap.gz chr7:1-159345973 2 2
chr8.gmap.gz chr8:1-145138636 2 2
chr9.gmap.gz chr9:1-138394717 2 2
chr10.gmap.gz chr10:1-133797422 2 2
chr11.gmap.gz chr11:1-135086622 2 2
chr12.gmap.gz chr12:1-133275309 2 2
chr13.gmap.gz chr13:1-114364328 2 2
chr14.gmap.gz chr14:1-107043718 2 2
chr15.gmap.gz chr15:1-101991189 2 2
chr16.gmap.gz chr16:1-90338345 2 2
chr17.gmap.gz chr17:1-83257441 2 2
chr18.gmap.gz chr18:1-80373285 2 2
chr19.gmap.gz chr19:1-58617616 2 2
chr20.gmap.gz chr20:1-64444167 2 2
chr21.gmap.gz chr21:1-46709983 2 2
chr22.gmap.gz chr22:1-50818468 2 2
chrX_par1.gmap.gz chrX:1-2781479 2 2
chrX_nonpar.gmap.gz chrX:2781480-155701382 1 2
chrX_par2.gmap.gz chrX:155701383-156040895 2 2

The config file is a text file with the headers:
• | ##version |
• | ##ref_build indicating the reference build used for the study. |
The Config file is a txt file and contains 4 columns, tabs delimited. Each of them must be populated.
Column information |
Description |
---|---|
First column: filename |
Specifies the genetic map basename, 1 name per line. Mixed ploidy chromosomes must be separated into par and non-par regions. Basenames must match genetic map basenames. |
Second column: region |
Specifies the start and end positions of the chromosome or sub-chromosome region with format <contig_name>:<start_position>-<end_position>. For chromosomes without mixed ploidy regions, the start position is 1, end position is the length of the chromosome (1-based, inclusive). For chromosomes with mixed ploidy regions, for each region, the start and end positions are those of the region (1-based, inclusive). |
Third column: mixed ploidy subject |
Specifies 2 on diploid chromosomes and PAR regions. 1 for non PAR region |
Fourth column: diploid subject |
Specifies 2 for all chromosomes |
for mixed ploidy chromosome ensure the genetic map is separated into as many PAR and non-PAR regions with no overlap. Example: for human data prefix should be chrX_par1, chrX_nonpar, and chrX_par2.

The sample type file is a required file. The number of samples and name of samples in the input multisample VCF and sample type file should match.
The sample type file is a txt file with the following format
• | 2 columns, tabs or space delimited |
• | First column: list of all sample names present in the input sample |
• | Second column: 1 or 2. 1 for subject with mixed ploidy chromosomes, 2 for subject with all diploid chromosomes. |