2.3 Post‐Processing: SGSeq Parser - labbces/SpliceScape GitHub Wiki

This is the final stage, executed by the merging_outputs.py script. The primary goal of this step is to create a single, unified SQLite database that consolidates the results from all preceding splicing analyses.

The script first defines the final database structure, creating two interconnected tables: splicing_events and sample_info. It then iterates through each sample and calls the appropriate parser (currently majiq_parser) to populate these tables. The result is a clean, relational, and queryable database containing all splicing information generated by the pipeline.

Inputs and Outputs

Type	Description
Input	• The main directory containing the MAJIQ voila output files. • A text file listing all SRA accessions that were processed. • The species name.
Output	A single SQLite database file (.db) containing the final, consolidated results.

Final Database Schema

The script generates the definitive database with the following two tables:

splicing_events Table

This table stores unique information about each splicing event identified across all samples.

Column	Description
event_id	A unique identifier for the specific splicing event.
search_id	A simplified identifier used for linking tables.
gene_name	The name of the gene where the event occurs.
seqid	The chromosome or scaffold ID.
strand	The strand of the gene (+ or -).
event_type	The type of event (e.g., ALTA, EX, IR).
start, end	The start and end coordinates of the event's primary junction.
coord	A simplified coordinate string (seqid:start-end).
full_coord	The full coordinate string of the event, compatible with genome browsers.
species	The name of the species.

sample_info Table

This table stores the quantitative data for each event on a per-sample basis, linking back to the splicing_events table.

Column	Description
event_id	Foreign key linking to the splicing_events table.
de_novo	Indicates if the event was discovered 'de novo' (1) or was from existing annotation (0).
mean_psi_majiq	The Percent Spliced-In (PSI) value for the event as calculated by MAJIQ.
psi_sgseq	The PSI value for the event as calculated by SGSeq (to be integrated).
sra_id	The SRA accession ID of the sample.
majiq, sgseq	Flags (1 or 0) indicating which tool detected this event.
species	The name of the species.

Script Arguments

Argument	Function	Required
--db	Path to the final SQLite database file that will be created.	Yes
--voila	Path to the parent directory containing the per-sample MAJIQ output folders.	Yes
--spp	The name of the species being processed.	Yes
--srr_list	Path to a text file containing the SRA accessions (one per line) to process.	Yes
--ref-table	Optional. The name of a table in the database (e.g., the metadata table) to link the sra_id to.	No
--ref-column	Optional. The column name in the reference table to link with sra_id. Defaults to sra_id.	No

Run Example

python3 merging_outputs.py --db "final_splicing_database.db" \```
                           ```--voila "/path/to/MAJIQ_results/voila/" \```
                           ```--spp "Arabidopsis_thaliana" \```
                           ```--srr_list "processed_samples.txt"