1.0 SpliceScapeReads Processing - labbces/SpliceScape GitHub Wiki

SpliceScape

SpliceScape is a bioinformatics pipeline designed for the large-scale identification and characterization of splicing events from RNA-Seq data. Built using the Nextflow workflow orchestrator, it provides an efficient, reproducible, and scalable solution for generating comprehensive splicing landscapes of an organism.

The pipeline automates all critical steps of splicing analysis, from raw data processing to the final characterization of events. It integrates state-of-the-art tools to ensure high accuracy and performance, including:

Data Cleaning: BBDuK
Splicing-Aware Alignment: STAR
Splicing Event Identification and Quantification: MAJIQ and SGSeq

This approach allows SpliceScape to handle large public datasets, making it a powerful tool for comparative transcriptomics and for exploring the dynamics of RNA processing.

📊 The diagram below provides an overview of the SpliceScape workflow, from the initial input files to the final results.

✅ The pipeline is divided into distinct stages:

Pre-processing: Metadata acquisition and filtering.
Core Pipeline: Downloading reads, quality control (BBDuK), genome indexing (STAR), mapping (STAR), and splicing analysis (MAJIQ & SGSeq).
Post-processing: Parsing and integrating results into a unified database.

Quick Start

Dependencies

To run SpliceScape, you will need the following main dependencies installed:

Core Software & Tools

Nextflow (v24.10.5 or later)
STAR
BBMap (provides BBDuK v35.85)
Samtools
MAJIQ
ffq (from pip)
wget & md5sum (Standard Linux/macOS utilities)

Programming Environments

Python (v3.7 or higher)
R (v4.4)

Required Python Libraries

requests (v2.28.1)
beautifulsoup4 (v0.0.1)
biopython (v1.79)
pandas (v1.4.0)

Required R Libraries

optparse (v1.7.5)
SGSeq (v1.38.0)
GenomicFeatures (v1.56.0)

Input Files

SpliceScape requires three main inputs to start an analysis:

Reference Genome: An uncompressed reference genome file in .fasta format, preferably downloaded from Phytozome.
Genome Annotation: An uncompressed genome annotation file in .gff3 format, also from Phytozome.
Sample List: A plain text (.txt) file containing the SRA accessions for your target samples, with one identifier per line.

For detailed instructions on how to obtain and format these files, please refer to the SpliceScape Wiki.

Running

1. Clone the repository:

git clone https://github.com/labbces/SpliceScape.git
cd SpliceScape

2. Configure the pipeline:

The pipeline is primarily configured using the splicescape_paired.config file. You must edit this file to provide the paths to your input files and executables. Below are the most critical parameters to set:

Section	"Name/Variable"	Description	Example
	workDir	Where files generated by Splicescape will be saved.	`"/home/Splicescape/results"`
params	outdir	Where Splicescape results will be saved (inside workDir).	`"${workDir}/output"`
	species	Species name - Must at least start with the Phytozome naming convention.	`"Athaliana_447"`
	reads_file	File containing SRR identifiers of interest (one per line).	`"/home/Splicescape/data/SRRs.txt"`
	genomeFASTA	Path to genome sequence of interest.	`"/home/Splicescape/data/Phytozome/Athaliana_447_Araport11/assembly/Athaliana_447_TAIR10.fa"`
	genomeGFF	Path to genome annotation.	`"/home/Splicescape/data/Phytozome/Athaliana_447_Araport11/annotation/Athaliana_447_Araport11.gene_exons.gff3"`
	genome_path	Path to Phytozome assembly directory.	`"/home/Splicescape/data/Phytozome/Athaliana_447_Araport11/assembly"`
	genome	Path containing both assembly and annotation.	`"/home/Splicescape/data/Phytozome/Athaliana_447_Araport11"`
	threads	Cores for genomic index generation with STAR.	`12`
	bbduk	Path to BBDuK executable.	`"/home/Splicescape/progs/bbmap_35.85/bbduk2.sh"`
	minlength	Minimum read length for BBDuK cleaning.	`60`
	trimq	Quality threshold for base trimming in sequencing reads (BBDuK).	`20`
	k	k-mer size used by BBDuK to find matches and remove specific sequences.	`27`
	rref	Path to file with sequences to be removed from reads (BBDuK).	`"/home/Splicescape/progs/BBMap/resources/adapters.fa"`
	maxmem	Maximum memory allocated for BBDuK.	`"20g"`
	r_libs	Path where R libraries are installed.	`"/home/Splicescape/R/library"`
	sgseq_cores	Maximum cores used by SGSeq.	`4`
	majiq_path	Path to MAJIQ bin directory.	`"/home/Splicescape/majiq/bin"`
	majiq_cores	Maximum cores used by MAJIQ.	`8`
executor	queueSize	Maximum number of files that can be simultaneously submitted to a cluster.	`50`
Activity Report (Report, timeline, trace)	enabled	Enables report generation.	`TRUE`
	file	Report filenames.	`"report.html"`, `"timeline.html"`, `"trace.txt"`
	overwrite	Allows overwriting files with same name (Caution when using -resume).	`TRUE`
SGE Cluster Profile	executor	Instructs Nextflow to submit each task as a "job" to SGE scheduler.	`sge`
	queue	Specifies default SGE queue for job submission.	`splicing.q@cluster`
	clusterOptions	Default options passed directly to SGE's qsub command for each job.	`-S /bin/bash -V -pe smp 2`
	scratch	Enables use of fast "scratch" directory on cluster for intermediate files.	`FALSE`
	maxForks	Limits execution to maximum concurrent processes of each type.	`5`
	withName	Overrides default settings for specific processes (resource allocation).	`"DOWNLOAD_READ_FTP { clusterOptions = '-S /bin/bash -pe smp 2 -l h_vmem=2G -V', maxForks = 10 }"`

3. Run SpliceScape:

Execute the pipeline using the nextflow run command. If you are using a cluster with a scheduler like SGE, you can use a profile.

nextflow run splicescape_paired.nf -c splicescape_paired.config -profile sge -resume

Please find this files at reads_processing.

! See next pages for more detailed information on each SpliceScape Process