Iterative gVCF Genotyper Analysis

Iterative gVCF Genotyper Analysis offers an iterative workflow to aggregate new samples into an existing cohort. The iterative workflow allows users to incrementally aggregate new batches of samples with existing batches, without having to redo the analysis from scratch across all samples every time when new samples are available. The workflow takes single sample gVCF files as input, and can be performed in a step-by-step mode if multiple batches of samples are available, or end-to-end mode, if only a single batch of samples is available. Multi-sample gVCF files output from the Pedigree Caller are also accepted as input.

Workflow steps

gVCF aggregation

Users can use iterative gVCF Genotyper to aggregate a batch of gVCF files into a cohort file and a census file. The cohort file is a condensed data format to store gVCF data in multiple samples, similar to a multisample gVCF. The census file stores summary statistics of all the variants and hom-ref blocks among samples in the cohort. When a large number of samples are available, users can divide samples into multiple batches each with a similar sample size (eg 1000 samples), and repeat Step 1 for every batch. If force genotyping was enabled for any input file, any ForceGT calls that are not also called by the variant caller will be ignored.

Census aggregation

After all per batch census files are generated, users can aggregate them into a single global census file. This step scales to aggregate thousands of batches, in a much more efficient way compared to aggregating gVCFs from all batches. When a new batch of samples becomes available users only need to perform Step 1 on the batch, aggregate the census file from the batch with the global census file from all previous batches, and generate an updated global census file.

msVCF generation

When a global census file is updated with new variant sites discovered and/or variant statistics updated at existing variant sites, the user can take per-batch cohort file, per-batch census file, and the global census file as input, and generate a multisample VCF for one batch of samples. The output multisample VCF contains the variants and alleles discovered in all samples from all batches, and also includes global statistics such as allele frequencies, the number of samples with or without genotypes, and the number of samples without coverage. Similar statistics among samples in the batch are also included. This step can be repeated for every batch of samples, and the number of records in each output multisample VCF is the same across all batches.

To facilitate parallel processing on distributed compute nodes the user can choose to split the genome into shards of equal size for each step, and process each shard using one instance of iterative gVCF Genotyper on each compute node. See option --shard.

There is a special treatment of alternative or unaligned contigs when the --shard option is enabled: all contigs that are not autosomes, X, Y or chrM are included in the last shard. No other contigs will be assigned to the last shard. The mitochondrial contig will always be on its own in the second to last shard.

If a combined msVCF of all batches is required, an additional step should run separately to merge all of the batch msVCF files into a single msVCF containing all samples.

Command-line arguments common to all steps

--enable-gvcf-genotyper-iterative

Set to true to run the iterative gVCF genotyper (always required).

--ht-reference

The file containing the reference sequence in FASTA format (always required).

--output-directory

The output directory (always required).

--output-file-prefix

The prefix used to label all output files (optional, default value dragen).

--shard

Use this option to process only a portion (shard) of the genome, when distributing the work across multiple compute nodes. Provide the index (1-based) on the shard to process and the total number of shards, in the format n/N (eg 1/50 = shard 1 of 50 shards). To facilitate concurrent processing within each job, the shard will by default be split into 10x the number of available threads.

--gg-regions

if the --shard option is not given, use this option to run iterative gVCF genotyper only for a subset of regions in the genome. The same regions must be used for each step.

•

The value is a list of regions delimited by comma

•

Each region is processed by one thread on the same compute node

--gg-regions-bed

If a path to a BED file is provided, it will limit the iterative gVCF Genotyper processing to the genome specified regions.

This option differs from --gg-regions-bed in that, if the number of regions exceeds 10 times the number of available threads, for example exome data, they will not necessarily be processed by independent threads, making the option faster and compatible with sharding. In this situation there will only take effect on step 1 or end-to-end mode.

--gg-discard-ac-zero

If set to true, the gVCF Genotyper does not print variant alleles that are not called (hom-ref genotype) in any sample. The default value is true.

--gg-remove-nonref

Removes the <NON_REF> symbolic allele from the output of gVCF Genotyper. This option should be used to support downstream tools that cannot process VCF lines with <NON_REF> or to generate more concise msVCFs.

This option needs to be enabled in step 1.

The default is false.

--gg-vc-filter

Discard input variants that failed filters in the upstream caller. The default is false. Affected records will have their genotype set to hom-ref and the filter string "ggf" added to FORMAT/FT.

--gg-hard-filter

Specifies a filtering expression to be applied to the output msVCF records. See "msVCF hard filtering" below. The default is to apply no filters.

--gg-skip-filtered-sites

Omits msVCF records that fail the given hard filter. The default is false.

--gvcfs-to-cohort-census	Set to true to aggregate gVCF files from one batch of samples into a cohort file and a census file.
--variant-list	The path to a file containing a list of input gVCF files, with the path to each file on a separate line.
--variant	if --variant-list is not given, use this option for each input gVCF file.

--aggregate-censuses	Set to true to aggregate a list of per batch census files into a global census file.
--input-census-list	The path to a file containing a list of input per batch census files (from Step1), with the path to each file on a separate line.

--generate-msvcf	Set to true to generate a multisample VCF for one batch of samples.
--input-cohort-file	The path to the per batch cohort file (from Step 1).
--input-census-file	The path to the per batch census file (from Step 1).
--input-global-census-file	The path to the global census file (from Step 2).

--variant-list	The path to a file containing a list of input gVCF files, with the path to each file on a separate line.
--variant	If --variant-list is not given, use this option for each input gVCF file.

--merge-batches	Set to true to merge msVCF files for a set of batches.
--input-batch-list	The path to a file containing a list of msVCF files to be merged, with the path to each file on a separate line. All the files listed must have been generated from the same global census file and all batches pertaining to that global census must be included in the merge.
--gg-enable-indexing	Set to true to generate a tabix index for the merged msVCF (default false).

Metric	Description	Scope	Number of values
HWE	Hardy-Weinberg Equilibrium P-value	Allele-wise	One for each alt allele
ExcHet	Excess Heterozygosity P-value	Allele-wise	One for each alt allele
HWEc2	Hardy-Weinberg Equilibrium P-value	Site-wise	1
IC	Inbreeding Coefficient	Site-wise	1