2.3 Post‐Processing: SGSeq Parser - labbces/SpliceScape GitHub Wiki

This is the final stage, executed by the merging_outputs.py script. The primary goal of this step is to create a single, unified SQLite database that consolidates the results from all preceding splicing analyses.

The script first defines the final database structure, creating two interconnected tables: splicing_events and sample_info. It then iterates through each sample and calls the appropriate parser (currently majiq_parser) to populate these tables. The result is a clean, relational, and queryable database containing all splicing information generated by the pipeline.


Inputs and Outputs

Type Description
Input • The main directory containing the MAJIQ voila output files. • A text file listing all SRA accessions that were processed. • The species name.
Output A single SQLite database file (.db) containing the final, consolidated results.

Final Database Schema

The script generates the definitive database with the following two tables:

splicing_events Table

This table stores unique information about each splicing event identified across all samples.

Column Description
event_id A unique identifier for the specific splicing event.
search_id A simplified identifier used for linking tables.
gene_name The name of the gene where the event occurs.
seqid The chromosome or scaffold ID.
strand The strand of the gene (+ or -).
event_type The type of event (e.g., ALTA, EX, IR).
start, end The start and end coordinates of the event's primary junction.
coord A simplified coordinate string (seqid:start-end).
full_coord The full coordinate string of the event, compatible with genome browsers.
species The name of the species.

sample_info Table

This table stores the quantitative data for each event on a per-sample basis, linking back to the splicing_events table.

Column Description
event_id Foreign key linking to the splicing_events table.
de_novo Indicates if the event was discovered 'de novo' (1) or was from existing annotation (0).
mean_psi_majiq The Percent Spliced-In (PSI) value for the event as calculated by MAJIQ.
psi_sgseq The PSI value for the event as calculated by SGSeq (to be integrated).
sra_id The SRA accession ID of the sample.
majiq, sgseq Flags (1 or 0) indicating which tool detected this event.
species The name of the species.

Script Arguments

Argument Function Required
--db Path to the final SQLite database file that will be created. Yes
--voila Path to the parent directory containing the per-sample MAJIQ output folders. Yes
--spp The name of the species being processed. Yes
--srr_list Path to a text file containing the SRA accessions (one per line) to process. Yes
--ref-table Optional. The name of a table in the database (e.g., the metadata table) to link the sra_id to. No
--ref-column Optional. The column name in the reference table to link with sra_id. Defaults to sra_id. No

Run Example

python3 merging_outputs.py --db "final_splicing_database.db" \```
                           ```--voila "/path/to/MAJIQ_results/voila/" \```
                           ```--spp "Arabidopsis_thaliana" \```
                           ```--srr_list "processed_samples.txt"