Input Files and Configuration - mbassalbioinformatics/SLICER GitHub Wiki

Input Files and Configuration

SLICER requires specific input files and parameters to run correctly. These can be provided via a configuration file or directly as command-line arguments.

1. Sequencing Data File

Format: SLICER accepts PacBio/ONT long-read sequencing data in either:
- Unsorted BAM format (.ubam)
- Compressed FASTQ format (.fq.gz)
Content: This file should contain the raw or HiFi reads from your PacBio sequencing run.

2. Configuration File (Recommended)

Using a configuration file is the recommended way to run SLICER, especially for complex analyses or to ensure reproducibility. The configuration file is a tab-separated values (TSV) / tab-delimited (TXT) file.

Argument: Specify the path to this file using the --arglist /path/to/your_config.tsv or --config_file /path/to/your_config.txt option when running SLICER.

Configuration File Parameters:

Parameter	Description	Example Value	Required?
`input`	Path to the input PacBio sequencing file (.ubam or .fq.gz).	`/data/pacbio_reads.fq.gz`	Yes
`outdir`	Path to the directory where SLICER will save all output files. If it doesn't exist, SLICER will attempt to create it.	`/results/slicer_run1/`	Yes
`setname`	A prefix string that will be used for naming output files. Helps in organizing multiple runs.	`experiment_A`	Yes
`mapref`	(Optional) Path to a FASTA file containing reference sequences for mapping. If provided, SLICER will perform reference-based analysis.	`/refs/construct_references.fasta`	No
`autoref`	(Optional) Specifies the de novo reference prediction method if `mapref` is NOT provided. Options: `slope` or `distance`. Default: `slope`.	`distance`	No
`thread`	Number of CPU cores SLICER should use for parallel processing tasks.	`16`	No (default: 1)
`config`	[IMPORTANT] Defines the structural configuration of the library being analyzed. Currently, SLICER primarily supports one main configuration (Design 1 from manuscript Figure 1). Set to `1`. Future versions might support more designs.	`1`	Yes
`lfs_sequence`	The full sequence of the Left Flanking Sequence (LFS) that is expected immediately upstream of your barcode.	`GTAATGTGGAAAGGACGAAACACCGCACCG`	Yes
`rfs_sequence`	The full sequence of the Right Flanking Sequence (RFS) that is expected between your barcode and core sequence.	`GTTTCTTGAAAAAGTGGCACCGAGTCGGTA`	Yes
`rbs_sequence`	The full sequence (or at least the starting portion) of the Right Backbone Sequence (RBS) expected immediately downstream of your core sequence.	`AGGAGCCACCATGGCCCCAAAGAAGAAGCG`	Yes
`slen`	The length (in base pairs) of the short anchor motifs SLICER will search for at the ends/starts of the `lfs_sequence`, `rfs_sequence`, and `rbs_sequence`. This value defines the exact match length for `LFS_end`, `RFS_start`, `RFS_end`, and `RBS_start`.	`20`	Yes
`min_barcode_len_factor`	(Optional) Factor to determine minimum barcode length relative to mode (mode * factor). Default: `0.75`.	`0.8`	No
`max_barcode_len_factor`	(Optional) Factor to determine maximum barcode length relative to mode (mode * factor). Default: `1.5`.	`1.2`	No
`mapq_threshold`	(Optional) Minimum MAPQ score to retain reads after alignment. Reads below this are discarded. Default: `0` (no filtering).	`20`	No
`hamming_dist_barcode`	(Optional) Max Hamming distance for barcode clustering in "distance" de novo method. Default: `1` or `2` (verify default).	`1`	No
`levenshtein_dist_core`	(Optional) Max Levenshtein distance for core sequence clustering in "distance" de novo method. Default: `5` or `10` (verify default).	`5`	No

Understanding Anchor Sequences and `slen`:

SLICER identifies key regions in your reads based on four anchor points derived from the sequences you provide:

LFS_end: The last slen bases of your lfs_sequence.
- Example: If lfs_sequence is AGCTAGCTAGCTAGCTAGCT and slen is 5, LFS_end will be TAGCT.
RFS_start: The first slen bases of your rfs_sequence.
- Example: If rfs_sequence is GATCGATCGATCGATCGATC and slen is 5, RFS_start will be GATCG.
RFS_end: The last slen bases of your rfs_sequence.
- Example: If rfs_sequence is GATCGATCGATCGATCGATC and slen is 5, RFS_end will be TCGATC.
RBS_start: The first slen bases of your rbs_sequence.
- Example: If rbs_sequence is TCGATCGATCGATCGATCGA and slen is 5, RBS_start will be TCGAT.

The region between LFS_end and RFS_start is considered the Barcode. The region between RFS_end and RBS_start is considered the Core Sequence.

SLICER Library Designs

Ensure that slen is not longer than any of the provided lfs_sequence, rfs_sequence, or rbs_sequence. The slen should be long enough to be unique but short enough to be consistently found despite sequencing errors near the ends of elements.

Example Configuration File (`config.tsv`):

input	/path/to/pacbio_data.ubam
outdir	/output/my_slicer_run/
setname	my_sample
lfs_sequence	GTAATGTGGAAAGGACGAAACACCGCACCGTAGCATGACGACTGACGATCG
rfs_sequence	ACTGCTAGCATCGATCGATCGATCGATCGTTTCTTGAAAAAGTGGCACCGAGTCGGTAGACTAGCATGCATGCA
rbs_sequence	AGGAGCCACCATGGCCCCAAAGAAGAAGCGACTAGCTAGCATCGATGCATG
slen	20
thread	16
config	1
# mapref /path/to/references.fa  # Uncomment if using a reference
# autoref distance              # Uncomment to use distance method if no mapref