Iterative gVCF Genotyper Analysis
Iterative gVCF Genotyper Analysis offers an iterative workflow to aggregate new samples into an existing cohort. The iterative workflow allows users to incrementally aggregate new batches of samples with existing batches, without having to redo the analysis from scratch across all samples every time when new samples are available. The workflow requires per sample gVCF files as input, and can be performed in a step-by-step mode if multiple batches of samples are available, or end-to-end mode, if only a single batch of samples is available.
Workflow steps
1. | gVCF aggregation |
Users can use iterative gVCF Genotyper to aggregate a batch of gVCF files into a cohort file and a census file. The cohort file is a condensed data format to store gVCF data in multiple samples, similar to a multisample gVCF. The census file stores summary statistics of all the variants and hom-ref blocks among samples in the cohort. When a large number of samples are available, users can divide samples into multiple batches each with a similar sample size (eg 1000 samples), and repeat Step 1 for every batch.
2. | Census aggregation |
After all per batch census files are generated, users can aggregate them into a single global census file. This step scales to aggregate thousands of batches, in a much more efficient way compared to aggregating gVCFs from all batches. When a new batch of samples becomes available users only need to perform Step 1 on the batch, aggregate the census file from the batch with the global census file from all previous batches, and generate an updated global census file.
3. | msVCF generation |
When a global census file is updated with new variant sites discovered and/or variant statistics updated at existing variant sites, the user can take per-batch cohort file, per-batch census file, and the global census file as input, and generate a multisample VCF for one batch of samples. The output multisample VCF contains the variants and alleles discovered in all samples from all batches, and also includes global statistics such as allele frequencies, the number of samples with or without genotypes, and the number of samples without coverage. Similar statistics among samples in the batch are also included. This step can be repeated for every batch of samples, and the number of records in each output multisample VCF is the same across all batches.
To facilitate parallel processing on distributed compute nodes the user can choose to split the genome into shards of equal size for each step, and process each shard using one instance of iterative gVCF Genotyper on each compute node. See option --shard.
--enable-gvcf-genotyper-iterative |
Set to true to run the iterative gVCF genotyper (always required). |
||||||
--ht-reference |
The file containing the reference sequence in FASTA format (always required). |
||||||
--output-directory |
The output directory (always required). |
||||||
--output-file-prefix |
The prefix used to label all output files (optional, default value dragen). |
||||||
--shard |
Use this option to process only a portion (shard) of the genome, when distributing the work across multiple compute nodes. Provide the index (1-based) on the shard to process and the total number of shards, in the format n/N (eg 1/50 = shard 1 of 50 shards). To facilitate concurrent processing within each job, the shard will by default be split into 10x the number of available threads. |
||||||
--gg-regions |
if the --shard option is not given, use this option to run iterative gVCF genotyper only for a subset of regions in the genome. The same regions must be used for each step.
|
||||||
--gg-discard-ac-zero |
If set to true, the gVCF Genotyper does not print variant alleles that are not called (hom-ref genotype) in any sample. The default value is true. |
||||||
--gg-remove-nonref |
Removes the <NON_REF> symbolic allele from the output of gVCF Genotyper. This option should be used to support downstream tools that cannot process VCF lines with <NON_REF> or to generate more concise msVCFs. Enabling this option forces the use of --gg-discard-ac-zero . The default is false. |
--gvcfs-to-cohort-census |
Set to true to aggregate gVCF files from one batch of samples into a cohort file and a census file. |
--variant-list |
The path to a file containing a list of input gVCF files, with the path to each file on a separate line. |
--variant |
if --variant-list is not given, use this option for each input gVCF file. |
--aggregate-censuses |
Set to true to aggregate a list of per batch census files into a global census file. |
--input-census-list |
The path to a file containing a list of input per batch census files (from Step1), with the path to each file on a separate line. |
--generate-msvcf |
Set to true to generate a multisample VCF for one batch of samples. Enabled by default if options are not selected for steps 1,2 or 3. |
--input-cohort-file |
The path to the per batch cohort file (from Step 1). |
--input-census-file |
The path to the per batch census file (from Step 1). |
--input-global-census-file |
The path to the global census file (from Step 2). |
--variant-list |
The path to a file containing a list of input gVCF files, with the path to each file on a separate line. |
--variant |
If --variant-list is not given, use this option for each input gVCF file. |
--merge-batches |
Set to true to merge msVCF files for a set of batches. |
--input-batch-list |
The path to a file containing a list of msVCF files to be merged, with the path to each file on a separate line. All the files listed must have been generated from the same global census file and all batches pertaining to that global census must be included in the merge. |
--gg-enable-indexing |
Set to true to generate a tabix index for the merged msVCF (default false). |
Mimalloc is a custom memory allocation library that can yield significant speed-ups in the iterative gVCF Genotyper workflow. In some deployments, eg cloud, it is automatically and seamlessly used but in other contexts it requires special user intervention to be activated, as at present it cannot be included in standard DRAGEN by default.
The convenience script mi_dragen.sh is provided, that loads the bundled library and can be transparently used in the same way as the DRAGEN executable. It is only intended and supported for use with the iterative gVCF Genotyper component, although it can in principle be applied for any other DRAGEN workflow.
The use of Mimalloc for other purposes is known to potentially lead to undesirable memory overuse and is not recommended. Using Mimalloc for other purposes is at your own risk.