Input Files and Configuration - mbassalbioinformatics/SLICER GitHub Wiki
Input Files and Configuration
SLICER requires specific input files and parameters to run correctly. These can be provided via a configuration file or directly as command-line arguments.
1. Sequencing Data File
- Format: SLICER accepts PacBio/ONT long-read sequencing data in either:
- Unsorted BAM format (
.ubam
) - Compressed FASTQ format (
.fq.gz
)
- Unsorted BAM format (
- Content: This file should contain the raw or HiFi reads from your PacBio sequencing run.
2. Configuration File (Recommended)
Using a configuration file is the recommended way to run SLICER, especially for complex analyses or to ensure reproducibility. The configuration file is a tab-separated values (TSV) / tab-delimited (TXT) file.
- Argument: Specify the path to this file using the
--arglist /path/to/your_config.tsv
or--config_file /path/to/your_config.txt
option when running SLICER.
Configuration File Parameters:
Parameter | Description | Example Value | Required? |
---|---|---|---|
input |
Path to the input PacBio sequencing file (.ubam or .fq.gz). | /data/pacbio_reads.fq.gz |
Yes |
outdir |
Path to the directory where SLICER will save all output files. If it doesn't exist, SLICER will attempt to create it. | /results/slicer_run1/ |
Yes |
setname |
A prefix string that will be used for naming output files. Helps in organizing multiple runs. | experiment_A |
Yes |
mapref |
(Optional) Path to a FASTA file containing reference sequences for mapping. If provided, SLICER will perform reference-based analysis. | /refs/construct_references.fasta |
No |
autoref |
(Optional) Specifies the de novo reference prediction method if mapref is NOT provided. Options: slope or distance . Default: slope . |
distance |
No |
thread |
Number of CPU cores SLICER should use for parallel processing tasks. | 16 |
No (default: 1) |
config |
[IMPORTANT] Defines the structural configuration of the library being analyzed. Currently, SLICER primarily supports one main configuration (Design 1 from manuscript Figure 1). Set to 1 . Future versions might support more designs. |
1 |
Yes |
lfs_sequence |
The full sequence of the Left Flanking Sequence (LFS) that is expected immediately upstream of your barcode. | GTAATGTGGAAAGGACGAAACACCGCACCG |
Yes |
rfs_sequence |
The full sequence of the Right Flanking Sequence (RFS) that is expected between your barcode and core sequence. | GTTTCTTGAAAAAGTGGCACCGAGTCGGTA |
Yes |
rbs_sequence |
The full sequence (or at least the starting portion) of the Right Backbone Sequence (RBS) expected immediately downstream of your core sequence. | AGGAGCCACCATGGCCCCAAAGAAGAAGCG |
Yes |
slen |
The length (in base pairs) of the short anchor motifs SLICER will search for at the ends/starts of the lfs_sequence , rfs_sequence , and rbs_sequence . This value defines the exact match length for LFS_end , RFS_start , RFS_end , and RBS_start . |
20 |
Yes |
min_barcode_len_factor |
(Optional) Factor to determine minimum barcode length relative to mode (mode * factor). Default: 0.75 . |
0.8 |
No |
max_barcode_len_factor |
(Optional) Factor to determine maximum barcode length relative to mode (mode * factor). Default: 1.5 . |
1.2 |
No |
mapq_threshold |
(Optional) Minimum MAPQ score to retain reads after alignment. Reads below this are discarded. Default: 0 (no filtering). |
20 |
No |
hamming_dist_barcode |
(Optional) Max Hamming distance for barcode clustering in "distance" de novo method. Default: 1 or 2 (verify default). |
1 |
No |
levenshtein_dist_core |
(Optional) Max Levenshtein distance for core sequence clustering in "distance" de novo method. Default: 5 or 10 (verify default). |
5 |
No |
slen
:
Understanding Anchor Sequences and SLICER identifies key regions in your reads based on four anchor points derived from the sequences you provide:
LFS_end
: The lastslen
bases of yourlfs_sequence
.- Example: If
lfs_sequence
isAGCTAGCTAGCTAGCTAGCT
andslen
is5
,LFS_end
will beTAGCT
.
- Example: If
RFS_start
: The firstslen
bases of yourrfs_sequence
.- Example: If
rfs_sequence
isGATCGATCGATCGATCGATC
andslen
is5
,RFS_start
will beGATCG
.
- Example: If
RFS_end
: The lastslen
bases of yourrfs_sequence
.- Example: If
rfs_sequence
isGATCGATCGATCGATCGATC
andslen
is5
,RFS_end
will beTCGATC
.
- Example: If
RBS_start
: The firstslen
bases of yourrbs_sequence
.- Example: If
rbs_sequence
isTCGATCGATCGATCGATCGA
andslen
is5
,RBS_start
will beTCGAT
.
- Example: If
The region between LFS_end
and RFS_start
is considered the Barcode.
The region between RFS_end
and RBS_start
is considered the Core Sequence.
Ensure that slen
is not longer than any of the provided lfs_sequence
, rfs_sequence
, or rbs_sequence
. The slen
should be long enough to be unique but short enough to be consistently found despite sequencing errors near the ends of elements.
config.tsv
):
Example Configuration File (input /path/to/pacbio_data.ubam
outdir /output/my_slicer_run/
setname my_sample
lfs_sequence GTAATGTGGAAAGGACGAAACACCGCACCGTAGCATGACGACTGACGATCG
rfs_sequence ACTGCTAGCATCGATCGATCGATCGATCGTTTCTTGAAAAAGTGGCACCGAGTCGGTAGACTAGCATGCATGCA
rbs_sequence AGGAGCCACCATGGCCCCAAAGAAGAAGCGACTAGCTAGCATCGATGCATG
slen 20
thread 16
config 1
# mapref /path/to/references.fa # Uncomment if using a reference
# autoref distance # Uncomment to use distance method if no mapref