TUSCO quick start (SQANTI3 QC) - ConesaLab/SQANTI3 GitHub Wiki
TUSCO (Transcriptome Universal Single-isoform COntrol) is a curated internal reference set of genes lacking alternative isoforms, designed to benchmark long-read transcriptome sequencing quality without external spike-in controls.
Unlike BUSCO (which can misinterpret alternative splicing as gene duplications) or spike-ins like SIRVs/ERCCs (which oversimplify real-sample complexity and neglect RNA degradation artifacts), TUSCO uses endogenous single-isoform genes to evaluate:
- Precision: Identifying transcripts that deviate from reference annotations
- Sensitivity: Verifying detection completeness of known transcripts
This module is integrated into SQANTI3 QC and generates an interactive HTML report with benchmarking metrics and IGV-style genome visualization plots.
The data/tusco/ directory contains a minimal dataset for testing SQANTI3 TUSCO benchmarking (~8.8 MB total):
| File | Size | Description |
|---|---|---|
tusco_genome.fa |
4.7 MB | Genomic regions around 46 TUSCO genes |
tusco_genome.fa.fai |
2 KB | FASTA index |
tusco_annotation.gtf |
3.7 MB | GENCODE v49 annotation subset (8,688 features) |
tusco_input.gtf |
13 KB | Example input transcripts (112 entries, 40 genes) |
Navigate to the SQANTI3 directory and create the conda environment:
cd /path/to/SQANTI3
conda env create -f SQANTI3.conda_env.ymlThis installs ~70+ packages including Python 3.11, R 4.3+, bioinformatics tools (minimap2, samtools, bedtools), and R packages (ggplot2, plotly, Gviz, rmarkdown).
Tip: For faster installation, use mamba instead of conda:
mamba env create -f SQANTI3.conda_env.ymlconda activate sqanti3Run SQANTI3 with TUSCO benchmarking enabled:
python sqanti3_qc.py \
--isoforms data/tusco/tusco_input.gtf \
--refGTF data/tusco/tusco_annotation.gtf \
--refFasta data/tusco/tusco_genome.fa \
--tusco human \
-d output/tusco_example \
--skipORFParameters:
-
--isoforms: Input transcript GTF file to benchmark -
--refGTF: Reference annotation GTF -
--refFasta: Reference genome FASTA -
--tusco human: Enable TUSCO benchmarking (usehumanormouse) -
-d: Output directory -
--skipORF: Skip ORF prediction (faster for testing)
The TUSCO module generates the following output files:
| File | Description |
|---|---|
<prefix>_TUSCO_report.html |
Interactive HTML benchmarking report with metrics and visualizations |
<prefix>_TUSCO_results.tsv |
Transcript categorization (transcript_id, associated_gene, structural_category, subcategory, TUSCO_category) |
igv_plots/ |
Directory with IGV-style genome visualization PNG plots (one per gene) |
logs/tusco_report.log |
Execution log for troubleshooting |
Standard SQANTI3 outputs are also generated:
-
<prefix>_classification.txt- Full SQANTI3 classification results -
<prefix>_corrected.gtf- Corrected GTF file -
<prefix>_junctions.txt- Junction information
To view the report, open <prefix>_TUSCO_report.html in a web browser.
SQANTI3 includes pre-built TUSCO reference panels:
| Species | File | Genes | Location |
|---|---|---|---|
| Human | tusco_human.tsv |
46 | src/utilities/report_qc/ |
| Mouse | tusco_mouse.tsv |
33 | src/utilities/report_qc/ |
Each TSV file contains columns: Ensembl Gene ID, Ensembl Transcript ID, Gene Symbol, Entrez ID, RefSeq mRNA, RefSeq Protein.
The TUSCO report calculates 7 benchmarking metrics:
| Metric | Description |
|---|---|
| Sensitivity | Proportion of reference transcripts correctly detected (TP / (TP + FN)) |
| Non-redundant Precision | Proportion of unique predicted transcripts that are correct (TP / (TP + FP)) |
| Redundant Precision | Precision accounting for redundant predictions |
| Positive Detection Rate | Rate of true positive detections among all predictions |
| False Discovery Rate | Proportion of predictions that are false positives (FP / (TP + FP)) |
| False Detection Rate | Rate of reference transcripts not detected (FN / (TP + FN)) |
| Redundancy | Ratio of total predictions to unique predictions |
Transcripts are classified into four TUSCO categories:
| Category | Definition |
|---|---|
| TP (True Positive) | Exact structural match to a TUSCO reference transcript (FSM - Full Splice Match) |
| PTP (Partial True Positive) | Partial match to reference (ISM - Incomplete Splice Match, or NIC/NNC with shared junctions) |
| FN (False Negative) | TUSCO reference transcript not detected in the input |
| FP (False Positive) | Predicted transcript that does not match any TUSCO reference |
If conda fails to solve the environment:
- Use mamba instead of conda (faster solver)
- Remove specific version constraints if packages are unavailable for your platform
Some packages may have limited ARM64 support. The core TUSCO functionality works on Apple Silicon, but you may need to:
- Install packages without strict version pins
- Skip optional dependencies like
parasailif they fail to build
If R packages fail to load, verify they are installed in the conda environment:
Rscript -e "library(Gviz); library(ggplot2); library(plotly); library(rmarkdown)"If IGV plots are not generated:
- Check
logs/tusco_report.logfor errors - Ensure the reference genome FASTA matches the annotation coordinates
- Verify Gviz R package is properly installed
- Genome: GRCh38.p14 (extracted regions)
- Annotation: GENCODE v49
- Input: WTC11 PacBio cDNA transcripts (subset)
Files use region-based chromosome names (e.g., chr1:1182237-1285041) to match the extracted genomic regions.
- SQANTI3 Documentation
- TUSCO Preprint - Liu et al., bioRxiv 2025