1.0 SpliceScapeReads Processing - labbces/SpliceScape GitHub Wiki
SpliceScape
SpliceScape is a bioinformatics pipeline designed for the large-scale identification and characterization of splicing events from RNA-Seq data. Built using the Nextflow workflow orchestrator, it provides an efficient, reproducible, and scalable solution for generating comprehensive splicing landscapes of an organism.
The pipeline automates all critical steps of splicing analysis, from raw data processing to the final characterization of events. It integrates state-of-the-art tools to ensure high accuracy and performance, including:
- Data Cleaning: BBDuK
- Splicing-Aware Alignment: STAR
- Splicing Event Identification and Quantification: MAJIQ and SGSeq
This approach allows SpliceScape to handle large public datasets, making it a powerful tool for comparative transcriptomics and for exploring the dynamics of RNA processing.
📊 The diagram below provides an overview of the SpliceScape workflow, from the initial input files to the final results.

✅ The pipeline is divided into distinct stages:
- Pre-processing: Metadata acquisition and filtering.
- Core Pipeline: Downloading reads, quality control (BBDuK), genome indexing (STAR), mapping (STAR), and splicing analysis (MAJIQ & SGSeq).
- Post-processing: Parsing and integrating results into a unified database.
Quick Start
Dependencies
To run SpliceScape, you will need the following main dependencies installed:
Core Software & Tools
- Nextflow (v24.10.5 or later)
- STAR
- BBMap (provides BBDuK v35.85)
- Samtools
- MAJIQ
- ffq (from pip)
- wget & md5sum (Standard Linux/macOS utilities)
Programming Environments
- Python (v3.7 or higher)
- R (v4.4)
Required Python Libraries
- requests (v2.28.1)
- beautifulsoup4 (v0.0.1)
- biopython (v1.79)
- pandas (v1.4.0)
Required R Libraries
- optparse (v1.7.5)
- SGSeq (v1.38.0)
- GenomicFeatures (v1.56.0)
Input Files
SpliceScape requires three main inputs to start an analysis:
- Reference Genome: An uncompressed reference genome file in
.fastaformat, preferably downloaded from Phytozome. - Genome Annotation: An uncompressed genome annotation file in
.gff3format, also from Phytozome. - Sample List: A plain text (
.txt) file containing the SRA accessions for your target samples, with one identifier per line.
For detailed instructions on how to obtain and format these files, please refer to the SpliceScape Wiki.
Running
1. Clone the repository:
git clone https://github.com/labbces/SpliceScape.git
cd SpliceScape
2. Configure the pipeline:
The pipeline is primarily configured using the splicescape_paired.config file. You must edit this file to provide the paths to your input files and executables. Below are the most critical parameters to set:
| Section | "Name/Variable" | Description | Example |
|---|---|---|---|
| workDir | Where files generated by Splicescape will be saved. | "/home/Splicescape/results" |
|
| params | outdir | Where Splicescape results will be saved (inside workDir). | "${workDir}/output" |
| species | Species name - Must at least start with the Phytozome naming convention. | "Athaliana_447" |
|
| reads_file | File containing SRR identifiers of interest (one per line). | "/home/Splicescape/data/SRRs.txt" |
|
| genomeFASTA | Path to genome sequence of interest. | "/home/Splicescape/data/Phytozome/Athaliana_447_Araport11/assembly/Athaliana_447_TAIR10.fa" |
|
| genomeGFF | Path to genome annotation. | "/home/Splicescape/data/Phytozome/Athaliana_447_Araport11/annotation/Athaliana_447_Araport11.gene_exons.gff3" |
|
| genome_path | Path to Phytozome assembly directory. | "/home/Splicescape/data/Phytozome/Athaliana_447_Araport11/assembly" |
|
| genome | Path containing both assembly and annotation. | "/home/Splicescape/data/Phytozome/Athaliana_447_Araport11" |
|
| threads | Cores for genomic index generation with STAR. | 12 |
|
| bbduk | Path to BBDuK executable. | "/home/Splicescape/progs/bbmap_35.85/bbduk2.sh" |
|
| minlength | Minimum read length for BBDuK cleaning. | 60 |
|
| trimq | Quality threshold for base trimming in sequencing reads (BBDuK). | 20 |
|
| k | k-mer size used by BBDuK to find matches and remove specific sequences. | 27 |
|
| rref | Path to file with sequences to be removed from reads (BBDuK). | "/home/Splicescape/progs/BBMap/resources/adapters.fa" |
|
| maxmem | Maximum memory allocated for BBDuK. | "20g" |
|
| r_libs | Path where R libraries are installed. | "/home/Splicescape/R/library" |
|
| sgseq_cores | Maximum cores used by SGSeq. | 4 |
|
| majiq_path | Path to MAJIQ bin directory. | "/home/Splicescape/majiq/bin" |
|
| majiq_cores | Maximum cores used by MAJIQ. | 8 |
|
| executor | queueSize | Maximum number of files that can be simultaneously submitted to a cluster. | 50 |
| Activity Report (Report, timeline, trace) | enabled | Enables report generation. | TRUE |
| file | Report filenames. | "report.html", "timeline.html", "trace.txt" |
|
| overwrite | Allows overwriting files with same name (Caution when using -resume). | TRUE |
|
| SGE Cluster Profile | executor | Instructs Nextflow to submit each task as a "job" to SGE scheduler. | sge |
| queue | Specifies default SGE queue for job submission. | splicing.q@cluster |
|
| clusterOptions | Default options passed directly to SGE's qsub command for each job. | -S /bin/bash -V -pe smp 2 |
|
| scratch | Enables use of fast "scratch" directory on cluster for intermediate files. | FALSE |
|
| maxForks | Limits execution to maximum concurrent processes of each type. | 5 |
|
| withName | Overrides default settings for specific processes (resource allocation). | "DOWNLOAD_READ_FTP { clusterOptions = '-S /bin/bash -pe smp 2 -l h_vmem=2G -V', maxForks = 10 }" |
3. Run SpliceScape:
Execute the pipeline using the nextflow run command. If you are using a cluster with a scheduler like SGE, you can use a profile.
nextflow run splicescape_paired.nf -c splicescape_paired.config -profile sge -resume
Please find this files at reads_processing.
! See next pages for more detailed information on each SpliceScape Process