Genotype Sample Demultiplexing

DRAGEN implements several strategies for demultiplexing of data sets that represent mixtures of cells from different individuals, such as cells pooled in one library prep or microfluidic run. Two of these strategies include a genotype-based and genotype-free demultiplexing. In genotype demultiplexing methods, DRAGEN can assign sample identity to cells based on alleles observed in reads in each cell. DRAGEN only accounts for SNVs. DRAGEN flags any doublets, such as droplets that contain multiple cells from different individuals.

To use genotype-based sample demultiplexing, you must provide a VCF file with genotypes for each sample in the data set. To use genotype-free sample demultiplexing, you must provide a VCF file with a set of external samples preferably coming from a population with the same genetic background. The GT field represents the sample genotypes.

For information on the cell-hashing demultiplexing method, see Cell-Hashing

Command-Line Options

You can use the following command line options for scRNA demultiplexing.

Option	Description
--single-cell-demux-sample-vcf	If using genotype-based sample demultiplexing, specify the VCF file that contains the sample genotypes.
--single-cell-demux-reference-vcf	If using genotype-free sample demultiplexing, specify the VCF file that contains the genotypes of a population with a similar genetic background to the samples you are using.
single-cell-demux-detect-doublets	Enable the doublet detection in genotype-based sample demultiplexing. The default value is false.
--single-cell-demux-number-sample	The number of samples you are using. This option is only applicable when using an external VCF reference specified with the single-cell-demux-reference-vcf option.

The following is an example command line to run the DRAGEN Single Cell RNA Pipeline with genotype-based demultiplexing.

dragen --enable-rna=true --enable-single-cell-rna=true --umi-source=fastq --single-cell-barcode 0_15 --single-cell-umi 16_25 -r reference_genomes/Mus_musculus/mm10/DRAGEN/8 -a reference_genomes/Mus_musculus/mm10/gtf/gencode.vM23.annotation.gtf.gz -1 lib1_S7_L001_R2_001.fastq.gz --umi-fastq lib1_S7_L001_R1_001.fastq.gz --RGID=1 --RGSM=sample1 --output-dir=/staging/out --output-file-prefix=sample1 --single-cell-demux-detect-doublet=true --single-cell-demux-sample-vcf=sample.vcf

The following is an example command line to run the DRAGEN Single Cell RNA Pipeline with genotype-free demultiplexing.

Outputs

You can find information related to the output of genotype-based scRNA sample demultiplexing in the following three files.

The <prefix>.scRNA.barcodeSummary.tsv contains per-cell metrics, including cell barcodes. The following columns contain information on demultiplexing per-cell. See Single Cell RNA Outputs for more information on <prefix>.scRNA.barcodeSummary.tsv metrics.

Column

Description

SampleIdentity

The SampleIdentity column can contain the following values:

•

sampleX—The particular cell (barcode) is uniquely assigned to a sample.

•

AMB(sampleX,sampleY)—The algorithm cannot determine the sample to assign the barcode to.

•

MIX(mixing_coef*sampleX+(100-mixing_coef)*sampleY)—The cell barcode is classified as doublet. For example, MIX(50*sampleX+50*sampleY).

IdentityQscore

The IdentityQscore column contains the value used to estimate the confidence of the sample identity call. After DRAGEN determines the doublet status of the cell as singlet, ambiguous, or doublet, the identity Q-score is defined as -10 * log10(Probability that the assigned identity is correct, given the second most likely identity and the doublet status).

The higher values of identity Q-score correspond to more confident sample identity calls.

The <prefix>.scRNA.demux.tsv file contains sample demultiplexing statistics that were used to infer sample identity of each cell.

Column	Description
Barcode	The cell barcode associated with the cell.
DemuxSNPCount	The number of SNPs that the reads of the cell barcode intersect.
DemuxReadCount	The number of UMIs of the cell barcode that intersect at least one SNP.
Pure Samples	Samples from the VCF file.
BestMixtureIdentity	Mixture sample with the highest log likelihood. Only available if --single-cell-demux-detect-doublets=true.
BestMixtureLogLikelihood	The log likelihood of the best mixture sample. Only available if --single-cell-demux-detect-doublets=true.

The <prefix>.scRNA.metrics.demuxSamples.csv file contains per-cell metrics, similar to the metrics reported for the overall data set in <prefix>.scRNA.metrics.csv.

Column	Description
Passing cells	The number of cell barcodes that passed.
Fraction genic reads in cells	Counted reads assigned to the cells that passed.
Median reads per cell	Total counted reads per cell that passed the filters.
Median UMIs per cell	Total counted UMIs per cell that passed the filters.
Median genes per cell	Genes with at least one UMI per cell that passed the filters.