1.0 SpliceScapeReads Processing - labbces/SpliceScape GitHub Wiki

SpliceScape

SpliceScape is a bioinformatics pipeline designed for the large-scale identification and characterization of splicing events from RNA-Seq data. Built using the Nextflow workflow orchestrator, it provides an efficient, reproducible, and scalable solution for generating comprehensive splicing landscapes of an organism.

The pipeline automates all critical steps of splicing analysis, from raw data processing to the final characterization of events. It integrates state-of-the-art tools to ensure high accuracy and performance, including:

  • Data Cleaning: BBDuK
  • Splicing-Aware Alignment: STAR
  • Splicing Event Identification and Quantification: MAJIQ and SGSeq

This approach allows SpliceScape to handle large public datasets, making it a powerful tool for comparative transcriptomics and for exploring the dynamics of RNA processing.

📊 The diagram below provides an overview of the SpliceScape workflow, from the initial input files to the final results.

.

✅ The pipeline is divided into distinct stages:

  1. Pre-processing: Metadata acquisition and filtering.
  2. Core Pipeline: Downloading reads, quality control (BBDuK), genome indexing (STAR), mapping (STAR), and splicing analysis (MAJIQ & SGSeq).
  3. Post-processing: Parsing and integrating results into a unified database.

Quick Start

Dependencies

To run SpliceScape, you will need the following main dependencies installed:

Core Software & Tools

  • Nextflow (v24.10.5 or later)
  • STAR
  • BBMap (provides BBDuK v35.85)
  • Samtools
  • MAJIQ
  • ffq (from pip)
  • wget & md5sum (Standard Linux/macOS utilities)

Programming Environments

  • Python (v3.7 or higher)
  • R (v4.4)

Required Python Libraries

  • requests (v2.28.1)
  • beautifulsoup4 (v0.0.1)
  • biopython (v1.79)
  • pandas (v1.4.0)

Required R Libraries

  • optparse (v1.7.5)
  • SGSeq (v1.38.0)
  • GenomicFeatures (v1.56.0)

Input Files

SpliceScape requires three main inputs to start an analysis:

  • Reference Genome: An uncompressed reference genome file in .fasta format, preferably downloaded from Phytozome.
  • Genome Annotation: An uncompressed genome annotation file in .gff3 format, also from Phytozome.
  • Sample List: A plain text (.txt) file containing the SRA accessions for your target samples, with one identifier per line.

For detailed instructions on how to obtain and format these files, please refer to the SpliceScape Wiki.

Running

1. Clone the repository:

git clone https://github.com/labbces/SpliceScape.git
cd SpliceScape

2. Configure the pipeline:

The pipeline is primarily configured using the splicescape_paired.config file. You must edit this file to provide the paths to your input files and executables. Below are the most critical parameters to set:

Section "Name/Variable" Description Example
workDir Where files generated by Splicescape will be saved. "/home/Splicescape/results"
params outdir Where Splicescape results will be saved (inside workDir). "${workDir}/output"
species Species name - Must at least start with the Phytozome naming convention. "Athaliana_447"
reads_file File containing SRR identifiers of interest (one per line). "/home/Splicescape/data/SRRs.txt"
genomeFASTA Path to genome sequence of interest. "/home/Splicescape/data/Phytozome/Athaliana_447_Araport11/assembly/Athaliana_447_TAIR10.fa"
genomeGFF Path to genome annotation. "/home/Splicescape/data/Phytozome/Athaliana_447_Araport11/annotation/Athaliana_447_Araport11.gene_exons.gff3"
genome_path Path to Phytozome assembly directory. "/home/Splicescape/data/Phytozome/Athaliana_447_Araport11/assembly"
genome Path containing both assembly and annotation. "/home/Splicescape/data/Phytozome/Athaliana_447_Araport11"
threads Cores for genomic index generation with STAR. 12
bbduk Path to BBDuK executable. "/home/Splicescape/progs/bbmap_35.85/bbduk2.sh"
minlength Minimum read length for BBDuK cleaning. 60
trimq Quality threshold for base trimming in sequencing reads (BBDuK). 20
k k-mer size used by BBDuK to find matches and remove specific sequences. 27
rref Path to file with sequences to be removed from reads (BBDuK). "/home/Splicescape/progs/BBMap/resources/adapters.fa"
maxmem Maximum memory allocated for BBDuK. "20g"
r_libs Path where R libraries are installed. "/home/Splicescape/R/library"
sgseq_cores Maximum cores used by SGSeq. 4
majiq_path Path to MAJIQ bin directory. "/home/Splicescape/majiq/bin"
majiq_cores Maximum cores used by MAJIQ. 8
executor queueSize Maximum number of files that can be simultaneously submitted to a cluster. 50
Activity Report (Report, timeline, trace) enabled Enables report generation. TRUE
file Report filenames. "report.html", "timeline.html", "trace.txt"
overwrite Allows overwriting files with same name (Caution when using -resume). TRUE
SGE Cluster Profile executor Instructs Nextflow to submit each task as a "job" to SGE scheduler. sge
queue Specifies default SGE queue for job submission. splicing.q@cluster
clusterOptions Default options passed directly to SGE's qsub command for each job. -S /bin/bash -V -pe smp 2
scratch Enables use of fast "scratch" directory on cluster for intermediate files. FALSE
maxForks Limits execution to maximum concurrent processes of each type. 5
withName Overrides default settings for specific processes (resource allocation). "DOWNLOAD_READ_FTP { clusterOptions = '-S /bin/bash -pe smp 2 -l h_vmem=2G -V', maxForks = 10 }"

3. Run SpliceScape:

Execute the pipeline using the nextflow run command. If you are using a cluster with a scheduler like SGE, you can use a profile.

nextflow run splicescape_paired.nf -c splicescape_paired.config -profile sge -resume

Please find this files at reads_processing.

! See next pages for more detailed information on each SpliceScape Process