2.3 Post‐Processing: SGSeq Parser - labbces/SpliceScape GitHub Wiki
This is the final stage, executed by the merging_outputs.py script. The primary goal of this step is to create a single, unified SQLite database that consolidates the results from all preceding splicing analyses.
The script first defines the final database structure, creating two interconnected tables: splicing_events and sample_info. It then iterates through each sample and calls the appropriate parser (currently majiq_parser) to populate these tables. The result is a clean, relational, and queryable database containing all splicing information generated by the pipeline.
Inputs and Outputs
| Type | Description |
|---|---|
| Input | • The main directory containing the MAJIQ voila output files. • A text file listing all SRA accessions that were processed. • The species name. |
| Output | A single SQLite database file (.db) containing the final, consolidated results. |
Final Database Schema
The script generates the definitive database with the following two tables:
splicing_events Table
This table stores unique information about each splicing event identified across all samples.
| Column | Description |
|---|---|
| event_id | A unique identifier for the specific splicing event. |
| search_id | A simplified identifier used for linking tables. |
| gene_name | The name of the gene where the event occurs. |
| seqid | The chromosome or scaffold ID. |
| strand | The strand of the gene (+ or -). |
| event_type | The type of event (e.g., ALTA, EX, IR). |
| start, end | The start and end coordinates of the event's primary junction. |
| coord | A simplified coordinate string (seqid:start-end). |
| full_coord | The full coordinate string of the event, compatible with genome browsers. |
| species | The name of the species. |
sample_info Table
This table stores the quantitative data for each event on a per-sample basis, linking back to the splicing_events table.
| Column | Description |
|---|---|
| event_id | Foreign key linking to the splicing_events table. |
| de_novo | Indicates if the event was discovered 'de novo' (1) or was from existing annotation (0). |
| mean_psi_majiq | The Percent Spliced-In (PSI) value for the event as calculated by MAJIQ. |
| psi_sgseq | The PSI value for the event as calculated by SGSeq (to be integrated). |
| sra_id | The SRA accession ID of the sample. |
| majiq, sgseq | Flags (1 or 0) indicating which tool detected this event. |
| species | The name of the species. |
Script Arguments
| Argument | Function | Required |
|---|---|---|
| --db | Path to the final SQLite database file that will be created. | Yes |
| --voila | Path to the parent directory containing the per-sample MAJIQ output folders. | Yes |
| --spp | The name of the species being processed. | Yes |
| --srr_list | Path to a text file containing the SRA accessions (one per line) to process. | Yes |
| --ref-table | Optional. The name of a table in the database (e.g., the metadata table) to link the sra_id to. | No |
| --ref-column | Optional. The column name in the reference table to link with sra_id. Defaults to sra_id. | No |
Run Example
python3 merging_outputs.py --db "final_splicing_database.db" \```
```--voila "/path/to/MAJIQ_results/voila/" \```
```--spp "Arabidopsis_thaliana" \```
```--srr_list "processed_samples.txt"