Input Files and Configuration - mbassalbioinformatics/SLICER GitHub Wiki

Input Files and Configuration

SLICER requires specific input files and parameters to run correctly. These can be provided via a configuration file or directly as command-line arguments.

1. Sequencing Data File

  • Format: SLICER accepts PacBio/ONT long-read sequencing data in either:
    • Unsorted BAM format (.ubam)
    • Compressed FASTQ format (.fq.gz)
  • Content: This file should contain the raw or HiFi reads from your PacBio sequencing run.

2. Configuration File (Recommended)

Using a configuration file is the recommended way to run SLICER, especially for complex analyses or to ensure reproducibility. The configuration file is a tab-separated values (TSV) / tab-delimited (TXT) file.

  • Argument: Specify the path to this file using the --arglist /path/to/your_config.tsv or --config_file /path/to/your_config.txt option when running SLICER.

Configuration File Parameters:

Parameter Description Example Value Required?
input Path to the input PacBio sequencing file (.ubam or .fq.gz). /data/pacbio_reads.fq.gz Yes
outdir Path to the directory where SLICER will save all output files. If it doesn't exist, SLICER will attempt to create it. /results/slicer_run1/ Yes
setname A prefix string that will be used for naming output files. Helps in organizing multiple runs. experiment_A Yes
mapref (Optional) Path to a FASTA file containing reference sequences for mapping. If provided, SLICER will perform reference-based analysis. /refs/construct_references.fasta No
autoref (Optional) Specifies the de novo reference prediction method if mapref is NOT provided. Options: slope or distance. Default: slope. distance No
thread Number of CPU cores SLICER should use for parallel processing tasks. 16 No (default: 1)
config [IMPORTANT] Defines the structural configuration of the library being analyzed. Currently, SLICER primarily supports one main configuration (Design 1 from manuscript Figure 1). Set to 1. Future versions might support more designs. 1 Yes
lfs_sequence The full sequence of the Left Flanking Sequence (LFS) that is expected immediately upstream of your barcode. GTAATGTGGAAAGGACGAAACACCGCACCG Yes
rfs_sequence The full sequence of the Right Flanking Sequence (RFS) that is expected between your barcode and core sequence. GTTTCTTGAAAAAGTGGCACCGAGTCGGTA Yes
rbs_sequence The full sequence (or at least the starting portion) of the Right Backbone Sequence (RBS) expected immediately downstream of your core sequence. AGGAGCCACCATGGCCCCAAAGAAGAAGCG Yes
slen The length (in base pairs) of the short anchor motifs SLICER will search for at the ends/starts of the lfs_sequence, rfs_sequence, and rbs_sequence. This value defines the exact match length for LFS_end, RFS_start, RFS_end, and RBS_start. 20 Yes
min_barcode_len_factor (Optional) Factor to determine minimum barcode length relative to mode (mode * factor). Default: 0.75. 0.8 No
max_barcode_len_factor (Optional) Factor to determine maximum barcode length relative to mode (mode * factor). Default: 1.5. 1.2 No
mapq_threshold (Optional) Minimum MAPQ score to retain reads after alignment. Reads below this are discarded. Default: 0 (no filtering). 20 No
hamming_dist_barcode (Optional) Max Hamming distance for barcode clustering in "distance" de novo method. Default: 1 or 2 (verify default). 1 No
levenshtein_dist_core (Optional) Max Levenshtein distance for core sequence clustering in "distance" de novo method. Default: 5 or 10 (verify default). 5 No

Understanding Anchor Sequences and slen:

SLICER identifies key regions in your reads based on four anchor points derived from the sequences you provide:

  1. LFS_end: The last slen bases of your lfs_sequence.
    • Example: If lfs_sequence is AGCTAGCTAGCTAGCTAGCT and slen is 5, LFS_end will be TAGCT.
  2. RFS_start: The first slen bases of your rfs_sequence.
    • Example: If rfs_sequence is GATCGATCGATCGATCGATC and slen is 5, RFS_start will be GATCG.
  3. RFS_end: The last slen bases of your rfs_sequence.
    • Example: If rfs_sequence is GATCGATCGATCGATCGATC and slen is 5, RFS_end will be TCGATC.
  4. RBS_start: The first slen bases of your rbs_sequence.
    • Example: If rbs_sequence is TCGATCGATCGATCGATCGA and slen is 5, RBS_start will be TCGAT.

The region between LFS_end and RFS_start is considered the Barcode. The region between RFS_end and RBS_start is considered the Core Sequence.

SLICER Library Designs

Ensure that slen is not longer than any of the provided lfs_sequence, rfs_sequence, or rbs_sequence. The slen should be long enough to be unique but short enough to be consistently found despite sequencing errors near the ends of elements.

Example Configuration File (config.tsv):

input	/path/to/pacbio_data.ubam
outdir	/output/my_slicer_run/
setname	my_sample
lfs_sequence	GTAATGTGGAAAGGACGAAACACCGCACCGTAGCATGACGACTGACGATCG
rfs_sequence	ACTGCTAGCATCGATCGATCGATCGATCGTTTCTTGAAAAAGTGGCACCGAGTCGGTAGACTAGCATGCATGCA
rbs_sequence	AGGAGCCACCATGGCCCCAAAGAAGAAGCGACTAGCTAGCATCGATGCATG
slen	20
thread	16
config	1
# mapref /path/to/references.fa  # Uncomment if using a reference
# autoref distance              # Uncomment to use distance method if no mapref