2.2 Post‐Processing: MAJIQ Parser - labbces/SpliceScape GitHub Wiki
After MAJIQ and VOILA have identified and quantified splicing events, the majiq_parser.py script performs the final step of the MAJIQ workflow. Its primary purpose is to parse the tab-separated (.tsv) output files from voila modulize, apply a more detailed classification logic to the events, and load the structured data into the final SQLite database.
This creates a unified and easily queryable database of all splicing events, integrating PSI values and sample information.
Inputs and Outputs
| Type | Description |
|---|---|
| Input | The .voila.tsv files generated by the voila modulize process. Each file contains quantified data for a specific event type (e.g., cassette exons, alternative splice sites). |
| Output | The script does not write new files but populates two tables within the project's final SQLite database: splicing_events and sample_info. |
Core Functionality
The script iterates through each splicing event reported by MAJIQ and performs two key actions:
1. Detailed Event Classification
While MAJIQ groups events into broad categories, this script analyzes the specific coordinates of exons and junctions to assign a more precise event type. Based on the strand and whether junction coordinates match known exon boundaries, it classifies events as:
-
EX(Exon Skipping): A canonical cassette exon event. -
ALTD(Alternative Donor): An alternative 5' splice site event. -
ALTA(Alternative Acceptor): An alternative 3' splice site event. -
ALTX(Complex Alternative Event): An event involving changes at both the 5' and 3' splice sites. -
IR(Intron Retention): An event where an intron is retained in the mature transcript. -
EP(Expected Path): Represents the constitutively spliced path (no alternative event).
2. Database Loading
The script takes the classified event information, along with the PSI values and sample identifiers, and inserts them as structured rows into the final SQLite database. This ensures that every piece of information is linked and stored efficiently.
Final Database Schema
The script populates two main tables in the SQLite database:
splicing_events Table
This table stores unique information about each splicing event discovered across all samples.
| Column | Description |
|---|---|
| event_id | A unique identifier for the splicing event (e.g., GeneName_chr:start-end_strand_EventType). |
| search_id | A simplified identifier used for internal lookups. |
| gene_name | The name of the gene where the event occurs. |
| seqid | The chromosome or scaffold ID. |
| strand | The strand of the gene (+ or -). |
| start, end | The start and end coordinates of the primary junction. |
| coord | A simplified coordinate string (seqid:start-end). |
| event_type | The detailed event type as classified by the parser (e.g., ALTA, EX). |
| full_coord | The full coordinate string of the event in a format compatible with the UCSC Genome Browser. |
sample_info Table
This table stores the quantitative data for each event on a per-sample basis.
| Column | Description |
|---|---|
| search_id | A foreign key linking to the splicing_events table. |
| de_novo | Indicates if the event was discovered 'de novo' by MAJIQ (1) or was based on existing annotation (0). |
| mean_psi_per_lsv_junction | The Percent Spliced-In (PSI) value for the event in this specific sample. |
| srr | The SRA accession ID of the sample. |
| event_id | The unique ID of the event this sample information pertains to. |
| majiq, sgseq | Flags indicating which tools detected this event (in this case, majiq would be 1). |