Running SQANTI‐reads - ConesaLab/SQANTI3 GitHub Wiki

Running SQANTI-reads

You can find out more about SQANTI-reads here: Keil N, Monzó C, McIntyre L, Conesa A (2025). Quality assessment of long read data in multisample lrRNA-seq experiments with SQANTI-reads. Genome Res. DOI: 10.1101/gr.280021.124

SQANTI-reads leverages SQANTI3, a tool for the analysis of the quality of transcript models, to develop a quality control protocol for replicated long-read RNA-seq experiments. The number/distribution of reads, as well as the number/distribution of unique junction chains (transcript splicing patterns), in SQANTI3 structural categories are compiled. Multi-sample visualizations of QC metrics can also be separated by experimental design factors. We introduce new metrics for 1) the identification of potentially under-annotated genes and putative novel transcripts and 2) variation in junction donors and acceptors.

Modes of Operation

SQANTI-reads has two main modes of operation:

From Scratch (Raw Data): You provide raw FASTQ or GTF files. SQANTI-reads will run the SQANTI3 QC pipeline for you (mapping if necessary, then classification) and then aggregate the results.
Using Existing Output (Fast Mode): You have already run SQANTI3 QC on your samples. SQANTI-reads will read the existing outputs and aggregate them. This is recommended if you want to parallelize the QC step externally or have already processed your data.

Input Requirements

1. Design File (`--design` / `-de`)

A CSV file defining your experiment. It must contain at least two columns:

sampleID: The name you want to use for the sample in the output plots and tables (e.g., wtc11_rep1).
file_acc: The file accessor or prefix used to locate your files.
- From Scratch: The prefix of your FASTQ/GTF files (e.g., if file is sample_A_pass.fastq, file_acc could be sample_A).
- Existing Output: The name of the directory containing the SQANTI3 output for that sample.

You can add extra columns (e.g., condition, replicate) to use for plot faceting (--factor).

Example design.csv:

sampleID,file_acc,condition
wtc11_cDNA,ENCFF105WIJ,cDNA
wtc11_CapTrap,ENCFF023EXJ,CapTrap
wtc11_R2C2,ENCFF063ASB,R2C2

2. Reference Files

Reference Genome: FASTA format (--refFasta).
Reference Annotation: GTF format (--refGTF).

Detailed Usage

Mode A: Running from Scratch (Raw Data)

Use this mode if you have FASTQ or GTF files and want SQANTI-reads to handle the processing.

Command:

python sqanti3_reads.py \
    --design design.csv \
    --raw_data_dir ./data/ \
    --refFasta hg38.fa \
    --refGTF hg38.ensGene.gtf \
    --output ./results/

--raw_data_dir / -i: Directory containing your input files.
Input matching: The tool looks for files in raw_data_dir matching {file_acc}*.fastq or {file_acc}*.g*f.
Workflow:
- If FASTQ found: Maps reads (minimap2/uLTRA) -> Converts to GTF -> Runs SQANTI3 QC.
- If GTF found: Runs SQANTI3 QC.

Mode B: Using Existing SQANTI3 Output (Fast Mode)

Use this mode if you have already generated SQANTI3 output files. This is faster and allows you to run the computationally intensive QC step in parallel (e.g., on a cluster) before running SQANTI-reads.

Command:

python sqanti3_reads.py \
    --design design.csv \
    --sqanti_dirs ./sqanti3_outputs/ \
    --refFasta hg38.fa \
    --refGTF hg38.ensGene.gtf \
    --output ./results/

--sqanti_dirs / -d: Parent directory containing the subdirectories of your SQANTI3 runs.
Expected Structure: Inside sqanti_dirs, there should be a folder named {file_acc} for each sample. Inside that folder, the classification file should be named {sampleID}_classification.txt.
- Example: ./sqanti3_outputs/ENCFF105WIJ/wtc11_cDNA_classification.txt

Arguments Summary

sqanti3_reads.py [-h] --refFasta REFFASTA --refGTF REFGTF --design INDESIGN [-i INPUT_DIR] [-p PREFIX] [-d SQANTI_DIRS] [-o OUTPUT]
                        [--report {pdf,html,both}] [--all_tables] [--pca_tables] [--min_ref_len MIN_REF_LEN] [-ge ANNOTEXP] [-je JXNEXP]
                        [-pc PERCCOV] [-pj PERCMAXJXN] [--aligner_choice {minimap2,uLTRA}] [-s SITES] [--skip_hash] [-f INFACTOR]
                        [-fl FACTORLVL] [-t CPUS] [-n CHUNKS] [--force_id_ignore] [--verbose] [-v]

Required

--refFasta: Reference genome file (FASTA).
--refGTF: Reference annotation file (GTF).
--design, -de: Path to the design CSV file.

Input / Output

--raw_data_dir, -i: Directory with raw FASTQ/GTF files (for Mode A). Default: ./
--sqanti_dirs, -d: Directory with existing SQANTI3 output folders (for Mode B).
--output, -o: Output directory for results. Default: ./
--prefix, -p: Prefix for output files. Default: sqantiReads
--report: Report format (pdf, html, both). Default: pdf
--all_tables: Export all intermediate tables.
--pca_tables: Export PCA analysis tables.

Analysis & Filtering

--aligner_choice: Aligner to use if mapping FASTQ (minimap2 or uLTRA). Default: minimap2
--cpus, -t: Number of threads. Default: 10
--chunks, -n: Number of chunks for parallelizing SQANTI3.
--min_ref_len: Minimum reference transcript length.
--gene_expression, -ge: Expression cutoff for under-annotated genes. Default: 100
--jxn_expression, -je: Coverage threshold for splice sites. Default: 10
--perc_coverage, -pc: % gene coverage for well-covered unannotated transcripts. Default: 20
--perc_junctions, -pj: % of max junctions for near FL transcripts. Default: 80

Visualization

--factor, -f: Column name in the design file to facet plots by (e.g., condition).
--factor_level, -fl: Specific factor level to evaluate.

Output Files

Output files are generated in the directory specified by --output.

Main Results (CSV)

{prefix}_gene_counts.csv: Read counts per gene, per sample.
{prefix}_ujc_counts.csv: Read counts per Unique Junction Chain (UJC), per sample.
{prefix}_length_summary.csv: Read length statistics.
{prefix}_cv.csv: Coefficient of variation metrics.
{prefix}_gene_classification.csv: Gene annotation categories (for genes meeting coverage thresholds).
{prefix}_putative_novel_transcripts.csv: Metrics on NIC/NNC transcripts.

Modified Input

processed_{design_file}.csv: Updated design file including paths to generated classification/junction files.
Sample-specific folders containing _reads_classification.txt (original classification + unique junction chain info).

Plots

{prefix}_plots.pdf: Comprehensive QC metrics visualization.
{prefix}_annotation_plots.pdf: Under-annotation metrics visualization.

Running sqanti-reads with the example data

Mode A (Raw Data)

python sqanti3_reads.py
    --raw_data_dir ./example/sqanti_reads_test \
     --output ./example/sqanti_reads_test/new_output \
     --design ./example/sqanti_reads_test/sqR_design_file.csv \
     --refGTF test/test_data/reference/test_reference.gtf \
     --refFasta test/test_data/genome/genome_test.fasta

Mode B (Fast mode)

python sqanti3_reads.py --sqanti_dirs ./example/sqanti_reads_test/results/
--output ./example/sqanti_reads_test/fast_mode_output
--design ./example/sqanti_reads_test/sqR_design_file.csv
--refGTF test/test_data/reference/test_reference.gtf
--refFasta test/test_data/genome/genome_test.fasta

Citing SQANTI-reads

If you are using SQANTI-reads, please cite: Keil N, Monzó C, McIntyre L, Conesa A (2025). Quality assessment of long read data in multisample lrRNA-seq experiments with SQANTI-reads. Genome Research, 35 (4), 987. DOI: 10.1101/gr.280021.124

SQANTI-reads is based on and uses SQANTI3, please also cite: Pardo-Palacios FJ, Arzalluz-Luque A, Kondratova L, Salguero P, Mestre-Tomás J, Amorín R, Estevan-Morió E, Liu T, Nanni A, McIntyre L, Tseng E, Conesa A (2024). SQANTI3: curation of long-read transcriptomes for accurate identification of known and novel isoforms. Nature Methods, 21, 793-797. DOI: 10.1038/s41592-024-02229-2