Running SQANTI BUGSI - ConesaLab/SQANTI3 GitHub Wiki
- You pass
--bugsi human
(ormouse
) on the SQANTI3 command line. - In
src/qc_pipeline.py
, SQANTI3 seesargs.bugsi
and calls:generate_bugsi_report(bugsi, outputClassPath, args.isoforms)
-
generate_bugsi_report()
(insrc/qc_output.py
) builds and executes:Rscript /…/utilities/report_qc/BUGSI_report.R \ <classification.txt> \ <bugsi_<human|mouse>.gtf> \ <your_transcript.gtf> \ <utilities_path>
-
Inputs:
-
classification.txt
: SQANTI3 classification of each isoform -
bugsi_<species>.gtf
: gold-standard GTF of known BUGSI genes (withensembl
/refseq
/gene_name
fields) -
transcript.gtf
: your full transcript GTF - path to the utilities directory
-
- Load libraries: ggplot2, dplyr, rtracklayer, Gviz, rmarkdown, etc.
-
Import data
-
classification_data
← read SQANTI3 TSV -
bugsi_gtf
←rtracklayer::import()
→ extract gene‐level table -
transcript_gtf
←rtracklayer::import()
-
-
ID-type classification
- Classify each
associated_gene
as "ensembl", "refseq", "gene_name", or "unknown" via regex - Choose dominant ID; if Ensembl, strip version suffixes from
gene_id
in transcripts
- Classify each
-
Clean & explode fusion records
- Drop fusion records, then re-add them with split
associated_gene
lists - Strip transcript/gene version suffixes and dedupe
- Drop fusion records, then re-add them with split
-
Define benchmarking sets
-
BUGSI_transcripts: isoforms whose
associated_gene
is in the gold list -
TP (True Positives): subset of BUGSI_transcripts with
subcategory == "reference_match"
- PTP (Partial TP): FSM/ISM but not RM
- FP (False Positives): novel categories (NIC, NNC, genic, fusion)
- FN (False Negatives): gold‐standard genes with no FSM/ISM hit
-
BUGSI_transcripts: isoforms whose
-
Compute metrics
- Sensitivity = # unique TP genes / # gold‐standard genes
- Non-redundant Precision = TP / total BUGSI_transcripts
- Redundant Precision = (TP + PTP) / total BUGSI_transcripts
- Positive Detection Rate = # unique (TP+PTP) genes / # gold‐standard genes
- False Discovery Rate = (total BUGSI_transcripts – TP) / total BUGSI_transcripts
- False Detection Rate = FP / total BUGSI_transcripts
- Redundancy = (FSM + ISM) / # unique (TP+PTP) genes
-
Render report
- Tabulate and round percentages
- Assign each isoform to "TP", "PTP", "FP", or "Missing" (for FN)
- Call
rmarkdown::render()
onSQANTI3_BUGSI_Report.Rmd
-
bugsi_style.css
andbugsi_script.js
accompany the Rmd to style the interactive report.
-
<your_prefix>_BUGSI_report.html
in your output directory, containing summary tables, bar/pie charts of TP/PTP/FP/FN, and interactive drill‑downs.
In short:
BUGSI cross‑links your SQANTI3 classification against a curated GTF of known single‑isoform genes, segments isoforms into TP/PTP/FP/FN, computes standard metrics, and wraps everything in a self‑contained RMarkdown HTML report.
- Retrieved GTFs from MANE Select (Human), GENCODE (Human/Mouse), and NCBI RefSeq (Human/Mouse).
- Cross-validation: kept only genes with a single, perfectly matching isoform across all sources (splice junctions, TSS, TTS).
- Initial candidates: 1,925 human genes; 2,345 mouse genes.
- Quantified median expression using GTEx (Human) and ENCODE (Mouse) RNA‑seq.
- Tissue-specific sets: ≥ 5 TPM in at least one tissue.
- Universal set: ≥ 1 TPM across every evaluated tissue.
- Integrated housekeeping genes from HRT Atlas v1.0.
-
Multi-exon genes:
- Extracted annotated junction coverages from Recount3.
- Computed μ = (Σ Cᵢ) / n.
- Threshold T = α × μ (α = 0.01).
- Excluded any gene with novel junction coverage Cₙₒᵥₑₗ > T.
- Used IntroVerse (Human) to remove genes with novel junctions in > 50% of GTEx samples per tissue.
-
Single-exon genes:
- Overlapped coordinates with refTSS; excluded any with alternative TSS evidence.
- Collaborated with GENCODE annotation experts to verify no plausible alternative isoforms.
- Human: 53 BUGSI genes
- Mouse: 37 BUGSI genes
- Tissue‑specific BUGSI gene lists are available at the BUGSI portal (https://bugsi.uv.es).