Iterative gVCF Genotyper Analysis

Iterative gVCF Genotyper Analysis offers an iterative workflow to aggregate new samples into an existing cohort. The iterative workflow allows users to incrementally aggregate new batches of samples with existing batches, without having to redo the analysis from scratch across all samples every time when new samples are available. The workflow requires per sample gVCF files as input, and can be performed in a step-by-step mode if multiple batches of samples are available, or end-to-end mode, if only a single batch of samples is available.

Workflow steps

gVCF aggregation

Users can use iterative gVCF Genotyper to aggregate a batch of gVCF files into a cohort file and a census file. The cohort file is a condensed data format to store gVCF data in multiple samples, similar to a multisample gVCF. The census file stores summary statistics of all the variants and hom-ref blocks among samples in the cohort. When a large number of samples are available, users can divide samples into multiple batches each with a similar sample size (eg 1000 samples), and repeat Step 1 for every batch.

Census aggregation

After all per batch census files are generated, users can aggregate them into a single global census file. This step scales to aggregate thousands of batches, in a much more efficient way compared to aggregating gVCFs from all batches. When a new batch of samples becomes available users only need to perform Step 1 on the batch, aggregate the census file from the batch with the global census file from all previous batches, and generate an updated global census file.

msVCF generation

When a global census file is updated with new variant sites discovered and/or variant statistics updated at existing variant sites, the user can take per-batch cohort file, per-batch census file, and the global census file as input, and generate a multisample VCF for one batch of samples. The output multisample VCF contains the variants and alleles discovered in all samples from all batches, and also includes global statistics such as allele frequencies, the number of samples with or without genotypes, and the number of samples without coverage. Similar statistics among samples in the batch are also included. This step can be repeated for every batch of samples, and the number of records in each output multisample VCF is the same across all batches.

To facilitate parallel processing on distributed compute nodes the user can choose to split the genome into shards of equal size for each step, and process each shard using one instance of iterative gVCF Genotyper on each compute node. See option --shard.

Command-line arguments common to all steps

--enable-gvcf-genotyper-iterative

Set to true to run the iterative gVCF genotyper (always required).

--ht-reference

The file containing the reference sequence in FASTA format (always required).

--output-directory

The output directory (always required).

--output-file-prefix

The prefix used to label all output files (optional, default value dragen).

--shard

Use this option to process only a portion (shard) of the genome, when distributing the work across multiple compute nodes. Provide the index (1-based) on the shard to process and the total number of shards, in the format n/N (eg 1/50 = shard 1 of 50 shards). To facilitate concurrent processing within each job, the shard will by default be split into 10x the number of available threads.

--gg-regions

if the --shard option is not given, use this option to run iterative gVCF genotyper only for a subset of regions in the genome. The same regions must be used for each step.

•

The value is a list of regions delimited by comma

•

Each region is processed by one thread on the same compute node

--gg-discard-ac-zero

If set to true, the gVCF Genotyper does not print variant alleles that are not called (hom-ref genotype) in any sample. The default value is true.

--gg-remove-nonref

Removes the <NON_REF> symbolic allele from the output of gVCF Genotyper. This option should be used to support downstream tools that cannot process VCF lines with <NON_REF> or to generate more concise msVCFs. Enabling this option forces the use of --gg-discard-ac-zero . The default is false.

--gvcfs-to-cohort-census	Set to true to aggregate gVCF files from one batch of samples into a cohort file and a census file.
--variant-list	The path to a file containing a list of input gVCF files, with the path to each file on a separate line.
--variant	if --variant-list is not given, use this option for each input gVCF file.

--aggregate-censuses	Set to true to aggregate a list of per batch census files into a global census file.
--input-census-list	The path to a file containing a list of input per batch census files (from Step1), with the path to each file on a separate line.

--generate-msvcf	Set to true to generate a multisample VCF for one batch of samples. Enabled by default if options are not selected for steps 1,2 or 3.
--input-cohort-file	The path to the per batch cohort file (from Step 1).
--input-census-file	The path to the per batch census file (from Step 1).
--input-global-census-file	The path to the global census file (from Step 2).

--variant-list	The path to a file containing a list of input gVCF files, with the path to each file on a separate line.
--variant	If --variant-list is not given, use this option for each input gVCF file.

--merge-batches	Set to true to merge msVCF files for a set of batches.
--input-batch-list	The path to a file containing a list of msVCF files to be merged, with the path to each file on a separate line. All the files listed must have been generated from the same global census file and all batches pertaining to that global census must be included in the merge.
--gg-enable-indexing	Set to true to generate a tabix index for the merged msVCF (default false).