Duplicate Marking

Algorithm

The DRAGEN duplicate-marking algorithm is modeled on the MarkDuplicates feature from the Picard toolkit. All the aligned reads are grouped into subsets. All the members of each subset are potential duplicates.

For two pairs to be duplicates, they must meet the following requirements:

•

Identical alignment coordinates at both ends. Position is adjusted for soft- or hard-clips from the CIGAR.

•

Identical orientations from the direction of the two ends, with the left-most coordinate being first.

If an unpaired read has an identical coordinate and orientation to either end of any other read, whether paired or not, it can be marked as a duplicate.

Unmapped or read pairs are never marked as duplicates.

DRAGEN identifies a group of duplicates, selects the best duplicate of the group, and marks the others with the BAM PCR or optical duplicate flag (0x400 or decimal 1024). For the comparison, duplicates are scored based on the average sequence Phred quality. Pairs receive the sum of the scores of both ends, while unpaired reads get the score of the one mapped end. The score is intended to preserve the reads with the highest-quality base calls.

If two pairs (or reads) have exactly matching quality scores, DRAGEN breaks the tie by choosing the pair with the higher alignment score. If there are multiple pairs that also tie on alignment score, then DRAGEN chooses a pair arbitrarily.

The score for an unpaired read R is the average Phred quality score per base, calculated as follows:

Where:

•

R is a BAM record.

•

QUAL is its array of Phred quality scores.

•

dedup-min-qual is a DRAGEN configuration option with default value of 15.

For a pair, the score is the sum of the scores for the two ends.

The score is stored as a one-byte number, with values rounded down to the nearest one-quarter. Rounding can lead to different duplicate marks than the ones chosen by Picard. However, the impact on variant calling results is negligible because the reads are close in quality.

Limitations

The following limitations apply to DRAGEN duplicate marking implementation:

•

When there are two duplicate reads or pairs with close Phred sequence quality scores, DRAGEN might choose a different winner from the one chosen by Picard. The differences have negligible impact on variant calling results.

•

If using a single FASTQ file as input, DRAGEN accepts only a single library ID as a command-line argument (PGLB). For this reason, the FASTQ inputs to the system must already be separated by library ID. Library ID cannot be used as a criterion for distinguishing nonduplicates.

Settings

The following options can be used to configure duplicate marking in DRAGEN:

Option	Description
--enable-duplicate-marking	To enable duplicate marking, set to true. When --enable-duplicate-marking is enabled, the output is sorted, regardless of the value of the enable-sort option.
--remove-duplicates	To suppress the output of duplicate records, set to true. If set to false, set the 0x400 flag in the FLAG field of duplicate BAM records. When --remove-duplicates is enabled, then enable-duplicate-marking is enabled as well.
--dedup-min-qual	Specifies the minimum Phred quality score for a base to be included in the quality score calculation used for choosing among duplicate reads.