Joint Genotyper Options

This section provides information on the options available for each of the genotypers.

The gVCF Genotyper uses a set of single sample gVCFs to output a multisample VCF that contains one entry per variant seen in any of the input gVCFs. Genotypes cannot be adjusted using population information.

gVCF Genotyper can also read gVCF files from an S3 bucket. For gVCF files in a public bucket, you can use URLs with the prefix s3:// or https:// in --variant or --variant-list. If the bucket requires authentication, you can use environment variables or config files. See the Samtools website for information on the htslib AWS S3 plugin.

gVCF Genotyper needs access to the index file for each gVCF input. The URLs for each gVCF and index file need to be combined as https://url1.gvcf.gz##idx##https://url2.gvcf.gz.tbi, and then passed to --variant or --variant-list on the command line.

The following parameters are available for gVCF Genotyper.

Option	Description
--enable-gvcf-genotyper	To enable the gVCF Genotyper, set to true.
--ht-reference	The file containing the reference sequence in FASTA format. --ht-reference is required.
--output-directory	The output directory. --output-directory is required.
--output-file-prefix	The prefix used to label all output files. --output-file-prefix is required.
--gg-output-format	The output file format. The default value is vcf.gz. The permitted output file formats are vcf.gz, vcf, or bcf. Only the vcf.gz format is compatible with the joint genotyper. If using a different format, you can convert the format using the open-source bcftools utility.
--gg-regions	The file that specifies the regions to run the gVCF Genotyper in. Variants outside these regions are ignored. The file can either be a bed file or a list of genomic regions specified using chromosome:start-end. Genomic regions can be separated by commas or line breaks. If using exome or enrichment data, specify the list of regions targeted by the probes to limit additional time spent processing unreliable genotype variants that lie outside the targeted regions.
--gg-enable-concat	Concat output for genomic regions into a single output file. By default, the value is set to true.
--gg-max-alternate-alleles	Maximum number of alternate alleles. By default, the value is set to 50. If there are more alleles than the set limit, alleles are ranked by frequency of occurrence in the input samples. The most common alleles are output.
--num-threads	The number of processor threads to use. The default is the number of cores available.
--gg-sites-list	Forces gVCF Genotyper to output depth information at each site in the file. The file format is bed or bed.gz. By default, force-genotyped sites that do not occur in any sample are not printed. To display these sites, set --gg-discard-ac-zero to false.
--gg-spvcf-out	Write output in sparse project VCF format. Set to true to enable. The option is disabled by default. For more information on sparse project VCF format, see Sparse Project VCF: efficient encoding of population genotype matrices.¹
--gg-enable-indexing	Build a tabix index for the output file. The option is enabled by default. The --gg-output-format must be set to vcf.gz to use --gg-enable-indexing.
--gg-drop-genotypes	Select to output only the alleles for each variant. By default, the value is set to false. --gg-drop-genotypes is equivalent to running bcftools view -G on the default output.
--gg-write-phased-gt	If set to false, the gVCF Genotyper ignores phasing information in the input file. The genotyper will write phased genotypes in samples as unphased into the output file. This option is enabled by default.
--gg-allele-list	[Optional] Force the output of genotypes at specified sites. The path of a vcf.gz or bcf file that contain the sites must be included.
--gg-remove-nonref	[Optional]Removes the <NON_REF> symbolic allele from the output of gVCF Genotyper. This option should be used to support downstream tools that cannot process VCF lines with <NON_REF>.
--gg-sample-rename-mapfile	[Optional] Specify the path to a file that contains a tab or comma-separated mapping of samples to be renamed in the merged output. Include the original sample, target name, and file name or use /Perl regex pattern/,substitution. The following is an example file: sampleX_3466,sampleX_3456 sampleY_1234,sampleY_filtered_1234,sampleY_30x_filtered.gvcf.gz /(\w+)_badsuffix/ \1_bettersuffix
--gg-concurrency-regions	[Optional] Specify the regions to process in parallel. You can specify a BED file, comma-separated list of regions, or a line separated list of regions.
--gg-discard-ac-zero	If set to true, the gVCF Genotyper does not print variant alleles that are not called (hom-ref genotype) in any sample. The default value is true.

Lin MF, Bai X, Salerno WJ, Reid JG. Sparse Project VCF: efficient encoding of population genotype matrices. Bioinformatics. 2020;36(22-23):5537-5538. doi:10.1093/bioinformatics/btaa1004

Joint Genotyper Options

You can run the Joint Genotyper from a multisample VCF, a multisample gVCF, or directly from a set of single sample gVCFs.

The following parameters are available for Joint Genotyper.

Option	Description
--enable-joint-genotyping	To run the Joint Genotyper, set to true.
--output-directory	The output directory. --output-directory is required.
--output-file-prefix	The prefix used to label all output files. --output-file-prefix is required.
-r	The directory where the hash table resides.
--variant --variant-list	Specifies the path to a single gVCF file. You can specify multiple gVCF files using multiple --variant options. A maximum of 200 gVCFs are supported. Use --variant-list to specify a file containing a list of gVCF files that need to be combined using one variant file path per line.
--pedigree-file	Specify the path to a pedigree file that describes the relationship between samples. For more information, see Pedigree Mode.