De Novo Reference Prediction - mbassalbioinformatics/SLICER GitHub Wiki

De Novo Reference Prediction

A key feature of SLICER is its ability to predict reference sequences directly from your PacBio long-read data when a pre-defined reference file (mapref) is not available. This is particularly useful for:

Discovery experiments where the exact sequences of all constructs are unknown.
Analyzing libraries with unexpected variants or novel assemblies.
Situations where generating a comprehensive reference list beforehand is impractical.

SLICER implements two distinct algorithms for de novo reference prediction, selectable via the autoref parameter in the configuration file (or command line):

The Slope Method (autoref slope):
- This is the default method if mapref is not provided and autoref is not specified.
- It assumes that true, unaltered barcode sequences will be significantly more abundant than their variants arising from sequencing errors or minor biological mutations.
- It identifies reference barcodes by looking for a steep drop ("slope") in read counts when barcodes are sorted by frequency.
The Distance Method (autoref distance):
- This method is an alternative for cases where the assumptions of the slope method might not hold (e.g., very high sequencing error rates, or when altered barcodes are unusually abundant).
- It groups barcodes based on sequence similarity (Hamming distance) and then refines these groupings by comparing their associated core sequences (Levenshtein distance).

Choosing a Method

Slope Method: Generally recommended as the starting point due to its efficiency and effectiveness when its core assumption (abundance difference) holds true. This is often the case with PacBio HiFi data.
Distance Method: Consider using this if the slope method yields unexpected results, if you suspect high error rates have led to many barcode variants with significant counts, or if your library has very low diversity leading to less clear abundance distinctions. This method is more computationally intensive.

Both methods aim to generate a FASTA file of predicted reference sequences, which SLICER then uses internally for the alignment and quantification steps. The predicted reference file is also saved in your output directory for your inspection.