Barcode Error Correction
Cell-barcode sequences from the input reads are error corrected based on their frequency, and optionally through a list of expected cell-barcode sequences. A cell-barcode sequence is corrected into another cell-barcode sequence if they differ only by one base (1 Hamming distance) and:
| • | Either the corrected cell-barcode is at least two times more frequent across all input reads |
Or
| • | The corrected cell-barcode is on the list of expected cell-barcode sequences, but the original cell-barcode is not |
When corrected, all the original cell-barcode reads are assigned to the corrected cell-barcode. The sequence error correction scheme is similar to the directional algorithm described in (Smith, Heger and Sudbery, 2020)¹.
To avoid overcounting UMIs based on sequence errors, UMI error correction is performed among all reads with the same cell-barcode mapping to the same gene. UMI sequences that are likely errors of another UMI are not counted.
¹Smith, T., Heger, A. and Sudbery, I., 2020. UMI-Tools: Modeling Sequencing Errors In Unique Molecular Identifiers To Improve Quantification Accuracy. [PDF] Cold Spring Harbor Laboratory Press. Available at: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5340976.
