Input Files

msVCF input list for the Phase Common step (step 1)

For the Phase Common step (step 1), it is recommended to provide msVCF generated with the DRAGEN gVCF Genotyper tool. This first step takes as input a .txt file with path to a single msVCF or a list of msVCF, one line per path. The msVCF must comply with the following requirements:

•

per chromosome msVCF OR positionally sorted msVCF shards spanning a whole chromosome without overlap. See below for shard definition

•

generated from the same reference build

•

compressed and indexed

•

with unphased GT calls

•

with no duplicates

•

all contigs of the studied genome must be listed in the header

Note: for mixed ploidy chromosomes each PAR and non-PAR regions of the chromosome must be treated as a single chromosome. For example, on human data, the sample input msVCF for chrX must be divided into chrX_par1, chrX_par2, and chrX_nonpar.

Genetic map (steps 1 and 3)

A per chromosome genetic map corresponding to the studied species and to the reference build used for the msVCF input is required. You can use your own genetic map computed from the recombination rate of the species and its reference genome, or use the genetic map corresponding to the human hg38 reference genome available in the Imputation files accessible in the Illumina DRAGEN Bio-IT Platform Support Site page. DRAGEN does not generate custom genetic map files.

The genetic map should follow the format:

•

3 columns: position, chromosome number, distance (cM), in this order and tab separated

•

Genetic map for mixed ploidy chromosome must be seperated into as many PAR and non PAR regions (e.g. for human, chromosome X is split into PAR1 chrX_par1, PAR2 chrX_par2 and non PAR chrX_nonpar regions)

•

Genetic map for region in which all samples are haploid is not needed (e.g. for human, chromosome Y chrY)

The user must ensure the genetic maps provided are from the same reference build than the reference used to generate the msVCF input.

Config file (steps 1 and 3)

This configuration file is a text file and is a required file. It allows for proper handling of haploid/diploid chromosomes and verifivation of concordence between genetic maps, msVCF input and sample type file information. Current configuration supports binary gender (male or female) and ploidy 2 or 1. When a region has different ploidies in male and female samples, the region is considered mixed ploidy region (e.g. for human, non PAR region on chromosome X chrX_nonpar).

You can use your own file, or use the file included in the genetic_maps-hg38-2.0 folder available at DRAGEN Bio-IT Platform support pages on the Illumina website.

Example of a Config file

##version=1.0

##ref_build=hg38

#filename region male_ploidy female_ploidy

chr1.gmap.gz chr1:1-248956422 2 2

chr2.gmap.gz chr2:1-242193529 2 2

chr3.gmap.gz chr3:1-198295559 2 2

chr4.gmap.gz chr4:1-190214555 2 2

chr5.gmap.gz chr5:1-181538259 2 2

chr6.gmap.gz chr6:1-170805979 2 2

chr7.gmap.gz chr7:1-159345973 2 2

chr8.gmap.gz chr8:1-145138636 2 2

chr9.gmap.gz chr9:1-138394717 2 2

chr10.gmap.gz chr10:1-133797422 2 2

chr11.gmap.gz chr11:1-135086622 2 2

chr12.gmap.gz chr12:1-133275309 2 2

chr13.gmap.gz chr13:1-114364328 2 2

chr14.gmap.gz chr14:1-107043718 2 2

chr15.gmap.gz chr15:1-101991189 2 2

chr16.gmap.gz chr16:1-90338345 2 2

chr17.gmap.gz chr17:1-83257441 2 2

chr18.gmap.gz chr18:1-80373285 2 2

chr19.gmap.gz chr19:1-58617616 2 2

chr20.gmap.gz chr20:1-64444167 2 2

chr21.gmap.gz chr21:1-46709983 2 2

chr22.gmap.gz chr22:1-50818468 2 2

chrX_par1.gmap.gz chrX:1-2781479 2 2

chrX_nonpar.gmap.gz chrX:2781480-155701382 1 2

chrX_par2.gmap.gz chrX:155701383-156040895 2 2

Instructions to make a custom configuration file:

The config file is a text file with the headers:

•

##version

•

##ref_build indicating the reference build used for the study.

The Config file is a txt file and contains 4 columns, tabs delimited. Each of them must be populated.

Column information	Description
First column: filename	Specifies the genetic map basename, 1 name per line. Mixed ploidy chromosomes must be separated into par and non-par regions. Basenames must match genetic map basenames.
Second column: region	Specifies the start and end positions of the chromosome or sub-chromosome region with format <contig_name>:<start_position>-<end_position>. For chromosomes without mixed ploidy regions, the start position is 1, end position is the length of the chromosome (1-based, inclusive). For chromosomes with mixed ploidy regions, for each region, the start and end positions are those of the region (1-based, inclusive).
Third column: mixed ploidy subject	Specifies 2 on diploid chromosomes and PAR regions. 1 for non PAR region
Fourth column: diploid subject	Specifies 2 for all chromosomes

for mixed ploidy chromosome ensure the genetic map is separated into as many PAR and non-PAR regions with no overlap. Example: for human data prefix should be chrX_par1, chrX_nonpar, and chrX_par2.

Sample type file (steps 1 and 3)

The sample type file is a required file. The number of samples and name of samples in the input multisample VCF and sample type file should match.

The sample type file is a txt file with the following format

•

2 columns, tabs or space delimited

•

First column: list of all sample names present in the input sample

•

Second column: 1 or 2. 1 for subject with mixed ploidy chromosomes, 2 for subject with all diploid chromosomes.