TUSCO novel - ConesaLab/SQANTI3 GitHub Wiki
TUSCO‑novel benchmarks a pipeline’s ability to detect truly novel isoforms by intentionally “lying” to the pipeline: the reference GTF is altered only for curated single‑isoform TUSCO genes so that the real expressed isoform appears novel relative to the supplied annotation.
Note: The curated TUSCO gene list is derived from the TSV panel (src/utilities/report_qc/tusco_<species>.tsv). TUSCO itself uses TSV panels only; no TUSCO GTF is required.
- Alters single‑isoform, multi‑exon TUSCO genes to hide one real internal junction and replace it with a plausible synthetic junction (canonical GT–AG; reasonable intron length; still multi‑exon).
- Tools that depend on the given annotation tend to degrade; reference‑agnostic discovery plus robust filtering (e.g., Iso‑Seq + SQANTI3 ML) better controls false positives.
Design constraints (for controlled difficulty and fairness):
- Modify only internal splice junctions; preserve TSS/TTS and multi‑exon structure.
- Use canonical motifs (GT–AG) and intron lengths within empirical bounds for the species/tissue.
- Leave all non‑TUSCO genes unchanged to localize the perturbation.
- Inputs: native reference GTF, genome FASTA, curated TUSCO single‑isoform gene list (human/mouse).
- For each multi‑exon TUSCO gene:
- Remove one true internal splice junction in the transcript.
- Insert a plausible synthetic junction (canonical; reasonable distance; preserves multi‑exon structure).
- Use this altered GTF as the only “reference” for reconstruction and for downstream classification/evaluation.
Simulator code: https://github.com/TianYuan-Liu/tusco-paper/blob/main/src/tusco_novel_simulator/tusco_novel_sim.py
- Quantify dependence on annotation vs sequencing data.
- Assess novel discovery under under‑annotated or misleading references (species/tissues).
- Genome FASTA:
hg38.faormm10.fa. - Native reference GTF: e.g., GENCODE.
- TUSCO gene list: derive from
src/utilities/report_qc/tusco_<species>.tsv(human or mouse).- Extract the TUSCO gene identifiers to a text file, one per line (e.g.,
tusco_genes.txt).
- Extract the TUSCO gene identifiers to a text file, one per line (e.g.,
- Your reads/alignments as required by each pipeline (BAMs or CCS/FASTQ for Iso‑Seq).
Use the simulator to modify only TUSCO genes:
python tusco_novel_sim.py \
--refGTF native.gtf \
--genome hg38.fa \
--tusco-list tusco_genes.txt \
--out tusco_novel.gtf \
--seed 42Sanity checks:
- Modify only TUSCO single‑isoform genes and only internal junctions.
- Ensure synthetic junctions are canonical and yield valid multi‑exon transcripts.
- Log which junction was modified per gene for reproducibility.
Provenance and determinism:
- Record the commit of the simulator and configuration used.
- Fix the RNG seed (e.g.,
--seed 42) and preserve the simulator log.
- StringTie2:
stringtie aligned.bam -G tusco_novel.gtf -o stringtie.gtf -L
- FLAIR (guide with altered annotation):
flair correct -q reads.fastq -g hg38.fa -f tusco_novel.gtf -o flair_correct flair collapse -g hg38.fa -r reads.fastq -q flair_correct.bed -f tusco_novel.gtf -o flair_collapse
- Bambu (R): provide
tusco_novel.gtfas the annotation object. - Iso‑Seq + SQANTI3 ML: run Iso‑Seq reference‑free; use
tusco_novel.gtfonly for SQANTI3 classification.
- TUSCO‑novel classification:
python sqanti3_qc.py \ --isoforms <tool_output.gtf> \ --refGTF tusco_novel.gtf \ --refFasta hg38.fa \ --tusco human|mouse \ --report html -o <prefix_novel> -d <outdir>
- Baseline (native reference):
python sqanti3_qc.py \ --isoforms <tool_output.gtf> \ --refGTF native.gtf \ --refFasta hg38.fa \ --tusco human|mouse \ --report html -o <prefix_native> -d <outdir>
Outputs are identical to the standard TUSCO report. See TUSCO Quick Start for report paths, metric definitions, and interpretation guidance.
- Expect larger drops under TUSCO‑novel for reference‑guided tools (e.g., StringTie2, Bambu) if they rely on annotation for novel splice discovery.
- Reference‑free discovery with stringent filtering (Iso‑Seq + SQANTI3 ML) typically:
- Minimizes false positives (FDR), especially when junction support is enforced.
- May show lower sensitivity/precision for exact TSS/TTS due to read length and end adjustments.
- Compare native vs TUSCO‑novel: the gap indicates reliance on annotation vs data.
Report context:
- TUSCO‑novel uses the same TSV panel and metrics as TUSCO. Interpreting Sn, nrPre, rPre, 1−FDR, PDR, and 1/red follows the same guidance as in the Quick Start.
- Restrict modification to multi‑exon TUSCO genes with well‑supported junctions; leave the rest of the annotation unchanged.
- Use a fixed RNG seed and write a per‑gene change log.
- Verify synthetic junctions against the genome (canonical motifs, reasonable intron sizes).
- Keep pipeline parameters identical between native and TUSCO‑novel runs for fair comparison.
Limitations:
- The stress test focuses on splice‑junction novelty; it does not simulate alternative TSS/TTS or structural variants.
- Results can depend on the choice of TUSCO panel (species/tissue) and the simulator constraints; report both explicitly.
- SQANTI3 version/commit and command lines for both native and TUSCO‑novel runs.
- Provenance of the TUSCO panel TSV (filename, species, checksum).
- Simulator commit, configuration, RNG seed, and logs.
- All generated logs (including
tusco_report.log) and HTML reports archived.