The Distance Method - mbassalbioinformatics/SLICER GitHub Wiki

The "distance" method offers an alternative approach for de novo reference generation, particularly suited for scenarios where the core assumption of the "slope" method—that true sequences are significantly more abundant than their variants—may not be valid. This can occur for instance, if altered or erroneous barcodes are unusually prevalent in the sequencing library.

This method starts by calculating pairwise Hamming distances between all unique barcode sequences identified in the dataset. Barcodes exhibiting a Hamming distance below a user-defined threshold are clustered together under the premise that they represent variants of a single conceptual "true" barcode. It is generally recommended to set this threshold conservatively to prevent overly aggressive grouping of distinct barcodes, and the optimal value may depend on the characteristic error profile of the LRS technology employed (e.g., PacBio HiFi versus ONT).

Once these primary barcode groups are established, SLICER proceeds to identify the most representative core sequence for each group. Within each barcode group, pairwise Levenshtein distances are computed between all unique associated core sequences. The core sequence demonstrating the lowest average pairwise dissimilarity score is selected as the canonical core sequence for that barcode group. This step yields an initial set of preliminary reference sequences, each constructed as: LFS + Barcode (from the group) + RFS + Selected Core Sequence.

To further refine these preliminary references and account for any potential over-segmentation during the initial barcode grouping, a second clustering step is performed. The consensus core sequences of these preliminary references are compared using Levenshtein distance. References whose core sequences are highly similar (i.e., below a user-specified Levenshtein distance threshold) are merged into a consolidated reference group, even if their constituent barcodes were initially considered distinct based on Hamming distance. Within each such merged group, the barcode belonging to the preliminary reference with the highest original read count is chosen to represent the final, consolidated reference sequence. These finalized de novo reference sequences are then compiled into a FASTA file, which serves as the input for the subsequent alignment stage.

fig4_distance