Iterative gVCF Genotyper Analysis

Iterative gVCF Genotyper Analysis offers an iterative workflow to aggregate new samples into an existing cohort. The iterative workflow allows users to incrementally aggregate new batches of samples with existing batches, without having to redo the analysis from scratch across all samples every time when new samples are available. The workflow requires per sample gVCF files as input, and can be performed in a step-by-step mode if multiple batches of samples are available, or end-to-end mode, if only a single batch of samples is available.

Workflow steps

1. gVCF aggregation

Users can use iterative gVCF Genotyper to aggregate a batch of gVCF files into a cohort file and a census file. The cohort file is a condensed data format to store gVCF data in multiple samples, similar to a multisample gVCF. The census file stores summary statistics of all the variants and hom-ref blocks among samples in the cohort. When a large number of samples are available, users can divide samples into multiple batches each with a similar sample size (eg 1000 samples), and repeat Step 1 for every batch.

2. Census aggregation

After all per batch census files are generated, users can aggregate them into a single global census file. This step scales to aggregate thousands of batches, in a much more efficient way compared to aggregating gVCFs from all batches. When a new batch of samples becomes available users only need to perform Step 1 on the batch, aggregate the census file from the batch with the global census file from all previous batches, and generate an updated global census file.

3. msVCF generation

When a global census file is updated with new variant sites discovered and/or variant statistics updated at existing variant sites, the user can take per-batch cohort file, per-batch census file, and the global census file as input, and generate a multisample VCF for one batch of samples. The output multisample VCF contains the variants and alleles discovered in all samples from all batches, and also includes global statistics such as allele frequencies, the number of samples with or without genotypes, and the number of samples without coverage. Similar statistics among samples in the batch are also included. This step can be repeated for every batch of samples, and the number of records in each output multisample VCF is the same across all batches.

To facilitate parallel processing on distributed compute nodes the user can choose to split the genome into shards of equal size for each step, and process each shard using one instance of iterative gVCF Genotyper on each compute node. See option --shard.