2.2 Post‐Processing: MAJIQ Parser - labbces/SpliceScape GitHub Wiki

After MAJIQ and VOILA have identified and quantified splicing events, the majiq_parser.py script performs the final step of the MAJIQ workflow. Its primary purpose is to parse the tab-separated (.tsv) output files from voila modulize, apply a more detailed classification logic to the events, and load the structured data into the final SQLite database.

This creates a unified and easily queryable database of all splicing events, integrating PSI values and sample information.

Inputs and Outputs

Type	Description
Input	The .voila.tsv files generated by the voila modulize process. Each file contains quantified data for a specific event type (e.g., cassette exons, alternative splice sites).
Output	The script does not write new files but populates two tables within the project's final SQLite database: splicing_events and sample_info.

Core Functionality

The script iterates through each splicing event reported by MAJIQ and performs two key actions:

1. Detailed Event Classification

While MAJIQ groups events into broad categories, this script analyzes the specific coordinates of exons and junctions to assign a more precise event type. Based on the strand and whether junction coordinates match known exon boundaries, it classifies events as:

EX (Exon Skipping): A canonical cassette exon event.
ALTD (Alternative Donor): An alternative 5' splice site event.
ALTA (Alternative Acceptor): An alternative 3' splice site event.
ALTX (Complex Alternative Event): An event involving changes at both the 5' and 3' splice sites.
IR (Intron Retention): An event where an intron is retained in the mature transcript.
EP (Expected Path): Represents the constitutively spliced path (no alternative event).

2. Database Loading

The script takes the classified event information, along with the PSI values and sample identifiers, and inserts them as structured rows into the final SQLite database. This ensures that every piece of information is linked and stored efficiently.

Final Database Schema

The script populates two main tables in the SQLite database:

splicing_events Table

This table stores unique information about each splicing event discovered across all samples.

Column	Description
event_id	A unique identifier for the splicing event (e.g., GeneName_chr:start-end_strand_EventType).
search_id	A simplified identifier used for internal lookups.
gene_name	The name of the gene where the event occurs.
seqid	The chromosome or scaffold ID.
strand	The strand of the gene (+ or -).
start, end	The start and end coordinates of the primary junction.
coord	A simplified coordinate string (seqid:start-end).
event_type	The detailed event type as classified by the parser (e.g., ALTA, EX).
full_coord	The full coordinate string of the event in a format compatible with the UCSC Genome Browser.

sample_info Table

This table stores the quantitative data for each event on a per-sample basis.

Column	Description
search_id	A foreign key linking to the splicing_events table.
de_novo	Indicates if the event was discovered 'de novo' by MAJIQ (1) or was based on existing annotation (0).
mean_psi_per_lsv_junction	The Percent Spliced-In (PSI) value for the event in this specific sample.
srr	The SRA accession ID of the sample.
event_id	The unique ID of the event this sample information pertains to.
majiq, sgseq	Flags indicating which tools detected this event (in this case, majiq would be 1).