Troubleshooting and FAQ - mbassalbioinformatics/SLICER GitHub Wiki

Troubleshooting and FAQ

This page lists common issues encountered while running SLICER and provides potential solutions or explanations.

Common Issues

Low Number of Valid Reads / High Discard Rate:
- Problem: A very small percentage of reads pass the initial filtering steps.
- Possible Causes & Solutions:
  - Incorrect Anchor Sequences: Double-check the lfs_sequence, rfs_sequence, and rbs_sequence provided in your configuration. Ensure they exactly match what's expected in your constructs.
  - Incorrect slen: The slen parameter (length of anchor motifs) might be too long (making exact matches rare due to errors) or too short (leading to non-specific matches). Experiment with slightly different slen values.
  - Poor Library Quality: The DNA constructs themselves may have issues (e.g., failed Golden Gate assembly, deletions affecting anchor sites).
  - High Sequencing Error Rate: If using older PacBio CLR data or if sequencing quality was low, exact matches for anchors might be difficult. SLICER currently requires exact matches for these initial anchors.
  - Off-Target Amplification/Sequencing: A large fraction of reads might not be from your intended constructs.
  - Check the Venn Diagram: The output _missing_anchors_venn.png will show which specific anchors are most frequently missing, which can help pinpoint the problem.
SLICER Fails to Start / Python Errors:
- Problem: The script exits immediately with a Python error (e.g., ModuleNotFoundError, ImportError).
- Possible Causes & Solutions:
  - Dependencies Not Installed: Ensure all prerequisite software (Minimap2, SAMtools, etc.) and Python libraries (see requirements.txt or Installation) are correctly installed and accessible in your environment.
  - Virtual Environment Not Activated: If you installed dependencies in a virtual environment, make sure it's activated before running SLICER.
  - Incorrect Python Version: SLICER may require a specific Python 3 version. Check the documentation.
Minimap2 / SAMtools / BamTools / Qualimap Errors:
- Problem: SLICER reports an error originating from one of its external tool dependencies.
- Possible Causes & Solutions:
  - Tool Not in PATH: The respective tool might not be installed correctly or its executable might not be in your system's PATH.
  - Tool Version Incompatibility: An incompatible version of the tool might be installed. Check SLICER's documentation for recommended versions.
  - Corrupted Input to Tool: Intermediate files passed to these tools might be corrupted or empty due to upstream issues in SLICER. Check SLICER's log file.
  - Insufficient Resources: The tool might be running out of memory or disk space.
De Novo Reference Prediction Issues:
- Problem ("Slope" Method): Too many or too few references predicted; reference sequences look incorrect.
  - Possible Causes: The assumption of clear abundance differences between true barcodes and variants may not hold for your dataset. Barcode length filtering might be too strict or too lenient.
  - Solutions: Try the "Distance" method. Adjust barcode length filtering parameters.
- Problem ("Distance" Method): Over-clustering or under-clustering of references.
  - Possible Causes: The Hamming or Levenshtein distance thresholds might be inappropriate for your data's error profile or diversity.
  - Solutions: Experiment with different hamming_dist_barcode and levenshtein_dist_core values. This often requires some empirical testing.
Low Alignment Rate / Poor Qualimap Metrics:
- Problem: A small percentage of filtered reads align to the reference (either provided or de novo). Qualimap reports show issues.
- Possible Causes & Solutions:
  - Incorrect Provided Reference: If using mapref, ensure your reference sequences accurately represent the constructs.
  - Poor De Novo Prediction: If using autoref, the predicted references might not be accurate. Try the alternative method or adjust its parameters.
  - Highly Divergent Constructs: Your actual constructs might be too different from the reference sequences due to extensive mutations, rearrangements, or errors.
  - Contamination: The sample might contain DNA that doesn't match your expected constructs.

Frequently Asked Questions (FAQ)

Q1: What is the config parameter in the configuration file?
- A1: The config parameter (e.g., config 1) refers to the expected structural layout of your sequenced amplicons, based on pre-defined designs (see manuscript Figure 1 for Design 1). SLICER uses this to correctly interpret the lfs_sequence, rfs_sequence, and rbs_sequence to find the barcode and core elements. Currently, Design 1 is the primary supported configuration.
Q2: How should I choose the slen value?
- A2: slen is the length of the anchor motifs (LFS_end, RFS_start, etc.) that SLICER searches for. It should be:
  - Long enough to be reasonably unique within your construct context.
  - Short enough that sequencing errors near the ends of elements don't prevent an exact match too often.
  - Typical values range from 15-25 bp. You may need to experiment. slen cannot be longer than the full lfs_sequence, rfs_sequence, or rbs_sequence you provide.
Q3: Can SLICER process Oxford Nanopore (ONT) data?
- A3: SLICER is primarily designed and optimized for PacBio long-read data, especially HiFi reads. That said, you CAN process ONT uBAM/FASTQ files as well. Just know that the error rate of ONT is higher and so if you are running SLICER, especially with de novo reference assembly, then you will likely get more sequences than you bargain for. In all honestly, we have not evaluated running SLICER with ONT data since we could not get our hands on specific datasets, but in theory, it should work. So maybe prepare for a little elbow grease to get things running smoothly first time round. (You can always contribute to the repo and make it better for everyone else if you DO have ONT data??)
Q4: What if my constructs have multiple barcodes or multiple core elements?
- A4: The current version of SLICER, as described, is designed to identify and extract a single barcode region and a single core sequence region, defined by one set of LFS/RFS/RBS anchors. Analyzing constructs with multiple, distinct barcoded elements or multiple core segments within a single read would require modifications to SLICER's core logic. If you feel up to the task then go for it. For our purposes though, we only had single barocode-core arrangement.
Q5: How does SLICER handle sequencing errors within the barcode or core sequence?
- A5:
  - Anchor Identification: SLICER requires exact matches for the short slen anchor motifs (LFS_end, RFS_start, etc.) for initial read processing.
  - Barcode/Core Variation:
    - In reference-based mode, Minimap2's alignment algorithm can tolerate mismatches, insertions, and deletions within the barcode and core when mapping to your provided reference.
    - In de novo mode, the "slope" method relies on abundance, while the "distance" method uses Hamming/Levenshtein distances to cluster similar (but not identical) barcodes and core sequences, thereby accounting for errors.
Q6: Where can I find example datasets?
- A6: Ammm.... wait for the publication using SLICER (hush hush for now, sorry).

If your issue is not listed here, please feel free to open an issue on our GitHub repository, providing as much detail as possible, including:

The SLICER version you are using.
The command or configuration file you used.
A snippet of the error message or relevant part of the log file.
A brief description of your data and expected outcome.