Video tutorial

Watch Tutorial Video on YouTube

Step-by-Step tutorial

The document is structured to guide users through various aspects of the SQANTI3 tool.

SQANTI3 Installation: This section will guide users on how to install the SQANTI3 tool. It may include prerequisites, dependencies, and step-by-step instructions for a successful installation.
Example Dataset Overview: Before diving into the execution of SQANTI3, users will be introduced to a sample dataset. This dataset includes directories and files that encompass the output from various phases of the SQANTI3 pipeline and the necessary input files to operate the software. Organized under directories and files, users will find outputs from machine learning filters, quality control, and rules-based filtering processes. In addition to the genomic and annotation files for chromosome 22, there are paired-end sequencing read files for two replicates of the Universal Human Reference (UHR) sample and scripts demonstrating how to run different SQANTI3 modules on the provided data. This comprehensive example dataset ensures users have a foundational understanding of the expected data structure and the processes involved in SQANTI3 operations.
Executing and Understanding SQANTI3 QC: Here, users will learn how to run the quality control (QC) component of SQANTI3 which focuses on characterizing transcriptomes generated from lrRNA-Seq data, comparing them to reference annotations, and leveraging various datasets including short-read sequencing, CAGE, and Quant-Seq for a comprehensive evaluation. The guide will delve into the details of FSM and ISM subcategories, the integration of TSS and TTS evidence, and the internal handling of short-read data using tools like STAR and Kallisto.
Executing and Understanding SQANTI3 Filter: This segment introduces the filtering capabilities of SQANTI3. Users will be guided on how to utilize both the machine learning-based and rules-based methods to efficiently differentiate between genuine isoforms and potential artifacts, ensuring a high-quality transcriptome tailored to their research needs.
Executing and Understanding SQANTI3 Rescue: This final section will focus on the 'rescue' functionality of SQANTI3. Users will learn the process of executing this feature and, just as importantly, understand the results and the rationale behind 'rescuing'. The SQANTI3 rescue algorithm aims to retrieve potentially discarded transcripts during long-read data processing by identifying appropriate reference transcripts, ensuring they meet quality criteria and are non-redundant, and then incorporating them into the final transcriptome.

1. SQANTI3 Installation
2. Example Dataset Overview
3. Executing and Understanding SQANTI3 QC
4. Executing and Understanding SQANTI3 Filter
5. Executing and Understanding SQANTI3 Rescue

1. Installation

1.1 Dependencies

General: Perl, Minimap2, Python (3.7), R (>= 3.4.0), kallisto, samtools, STAR, uLTRA, deSALT, pip
Python Libraries: bx-python, BioPython, BCBioGFF, cython, NumPy, pysam, pybedtools, psutil, pandas, scipy
R Libraries: R packages for sqanti3_qc.py and sqanti3_filter.py (installed with conda environment)

1.2 Installation

1.2.1. Anaconda: Install/update Anaconda. Add to PATH and update if needed:

export PATH=$HOME/anacondaPy37/bin:$PATH
conda -V
conda update conda

1.2.2 SQANTI3: Download SQANTI3 v5.4 and extract:

wget https://github.com/ConesaLab/SQANTI3/archive/refs/tags/v5.4.tar.gz
tar -xvf v5.4.tar.gz
cd SQANTI3-5.4

1.2.3 Conda Environment: Create and activate environment:

conda env create -f SQANTI3.conda_env.yml
source activate sqanti3

2. Example Dataset Overview

This section provides a brief overview of the example dataset to be utilized throughout this folder. The dataset consists of various files and directories that represent the output from different stages of the SQANTI3 pipeline. The requisite input files for running SQANTI3 are stored within this directory.

2.1 Output directories

SQANTI3_QC_output: Contains output files generated by the SQANTI3 Quality Control (QC) process.
ml_filter_output: This directory contains output files from the machine learning filter process in SQANTI3.
rules_filter_output: Houses output files from the rules-based filtering process in SQANTI3.
rescue_automatic: This directory contains the output files generated from running SQANTI3 rescue in automatic mode.
rescue_rules: In this directory, rescue has been run in full mode using the results from the rules filter.
rescue_ml: This directory contains the output files generated from running SQANTI3 rescue in machine learning mode. It used the same random forest model that was generated during the machine learning filter process.

2.2 Input files

Genome and Annotation Files: They are stored in the reference directory.
- GRCh38.p13_chr22.fasta: A FASTA file containing the sequence of chromosome 22 from the GRCh38.p13 version of the human genome.
- GRCh38.p13_chr22.fasta.fai: Index file for the FASTA file, facilitating quick access to different regions within the file.
- Homo_sapiens_GRCh38_Ensembl_86.chr22.gff3: A GFF3 file containing an annotation for chromosome 22 from the Ensembl 86 release of the human genome.
- gencode.v38.basic_chr22.gtf: A GTF file containing GENCODE basic gene annotation for chromosome 22 from the GENCODE v38 release.
Sample Data:
- UHR_chr22.gtf: A GTF file containing transcript annotations for the UHR sample, focused on chromosome 22.
- UHR_abundance.tsv: A tab-separated file containing transcript abundance estimates for the UHR sample.
- UHR_chr22_short_reads.fofn: A file of file names (FOFN) listing short-read data files related to the UHR sample on chromosome 22.
  - UHR_Rep1_chr22.R1.fastq.gz and UHR_Rep1_chr22.R2.fastq.gz: Paired-end sequencing read files for replicate 1 of a Universal Human Reference (UHR) sample, focused on chromosome 22.
  - UHR_Rep2_chr22.R1.fastq.gz and UHR_Rep2_chr22.R2.fastq.gz: Paired-end sequencing read files for replicate 2 of a UHR sample, focused on chromosome 22.

2.3 Example scripts and config files:

Scripts:
- run_SQANTI3_QC.sh: A shell script demonstrating how to run the SQANTI3 Quality Control (QC) process on the example data.
- run_SQANTI3_MLfilter.sh: A shell script demonstrating how to run the SQANTI3 machine learning filter.
- run_SQANTI3_rules_filter.sh: A shell script demonstrating how to run the SQANTI3 rules-based filtering process on the example data.
Config files: They can be found under examples/config_files directory, in YAML format, to be used with the wrapper.

3. Executing and Understanding SQANTI3 QC (Video for this part)

SQANTI3 is the updated iteration of the SQANTI tool designed for quality control and characterization of transcriptomes built from lrRNA-Seq data. Central to SQANTI3 is its Quality Control (QC) module, which evaluates the de novo transcriptome against a reference annotation and outputs a classification file. This classification not only delves deep into transcript variations like Full Splice Matches (FSM) and Incomplete Splice Matches (ISM) but also introduces new subcategories for them. Additionally, SQANTI3 QC integrates evidence from both short and long-read data, even accepting orthogonal data types like CAGE and Quant-Seq, to support and validate transcript start and end sites. The software has also been enhanced to handle short-read data internally, incorporating tools like STAR and Kallisto for mapping and quantification.

3.1 Running SQANTI3 QC with example command:

The provided command runs the sqanti3_qc.py script which is designed for structural and quality annotation of novel transcript isoforms.

3.1.1 Example script

python sqanti3_qc.py --isoforms example/UHR_chr22.gtf \
                     --refGTF example/gencode.v38.basic_chr22.gtf \
                     --refFasta example/GRCh38.p13_chr22.fasta    \
                     --CAGE_peak data/ref_TSS_annotation/human.refTSS_v3.1.hg38.bed    \
                     --polyA_motif_list data/polyA_motifs/mouse_and_human.polyA_motif.txt    \
                     -o UHR_chr22 -d example/SQANTI3_output -fl example/UHR_abundance.tsv    \
                     --short_reads example/UHR_chr22_short_reads.fofn --cpus 4 --report both

3.1.2 Required Arguments:

--isoforms example/UHR_chr22.gtf:
- Input isoforms file in GTF format. This could be given in FASTA format as well.
--refGTF example/gencode.v38.basic_chr22.gtf:
- Reference annotation file in GTF format.
--refFasta example/GRCh38.p13_chr22.fasta:
- Reference genome file in FASTA format.

3.1.3 Optional Arguments:

--CAGE_peak data/ref_TSS_annotation/human.refTSS_v3.1.hg38.bed:
- Specifies a BED format file with FANTOM5 CAGE Peak data for annotating transcription start sites (TSS).
--polyA_motif_list data/polyA_motifs/mouse_and_human.polyA_motif.txt:
- Provides a ranked list of polyA motifs that are sequences often found near polyadenylation sites.
-o UHR_chr22:
- Sets the prefix for the output files to UHR_chr22.
-d example/SQANTI3_output:
- Designates the directory where output files will be saved: example/SQANTI3_output.
-fl example/UHR_abundance.tsv:
- Specifies a file containing full-length PacBio abundance data.
--short_reads example/UHR_chr22_short_reads.fofn:
- Provides a File Of File Names (fofn) with paths to FASTA or FASTQ files from short-read RNA-Seq.
--cpus 4:
- Sets the number of threads (or CPU cores) to be used during alignment by aligners to 4.
--report both:
- Specifies the report format, producing outputs in both HTML and PDF formats.

Note: The backslashes (\) at the end of lines in the command are used to split a lengthy command into multiple readable lines. Each backslash indicates the continuation of the command on the next line.

3.2 Understanding SQANTI3 QC:

SQANTI3 QC produces a series of output files following its execution, using flags like --dir or -d and --output or -o to specify the directory and file prefix, respectively. This tool also relies on other software, including STAR, kallisto, and TD2, which means there will be subfolders in the output directory. For an example output, refer to the example/SQANTI3_QC_output subfolder in the main SQANTI3 directory.

3.2.1 SQANTI3 transcriptome classification (Video link for this part)

SQANTI3 sorts each isoform by comparing it to a known reference transcript. Here's how it categorizes them:

FSM (Full Splice Match) Subcategories:

The reference and query isoform both have the same exon count, with each internal junction aligning to the reference. However, the exact start (5') and end (3') can vary within the first and last exons.

Reference Match: The query isoform matches the reference isoform exactly (less than 50bp), including both internal junctions and exact 5' start and 3' end.
5' Mismatch: The internal junctions match, but the 5' start is different (difference higher than 50bp).
3' Mismatch: The internal junctions match, but the 3' end is different (difference higher than 50bp).
5' and 3' Mismatch: The internal junctions match, but both the 5' and 3' ends are different (difference higher than 50bp).

ISM (Incomplete Splice Match)

The query isoform has fewer outer exons compared to the reference, but all internal junctions align with the reference's positions. The exact start (5') and end (3') can vary within the initial/final exons.

5' fragments: The query isoform is missing some 5' exons compared to the reference.
3' fragments: The query isoform is missing some 3' exons compared to the reference.
Internal fragments: Missing some start and end exons but retains the internal exons.
intron retention: Contains an intron that is usually spliced out in the reference.

NIC (Novel In Catalog) Subcategories & NNC (Novel Not in Catalog) Subcategories:

NIC (Novel In Catalog): The term "Novel In Catalog" (NIC) refers to a subtype of novel isoforms that are characterized by their utilization of known splice junctions. While the junctions, which are pairs of donor-acceptor sites, are previously identified and cataloged, the specific combinations in which these junctions are assembled in the isoform are novel. Essentially, NIC represents isoforms that employ previously recognized junctions in a unique or previously undocumented manner.

NNC (Novel Not in Catalog): The "Novel Not in Catalog" (NNC) denotes a subtype of novel isoforms where at least one of the splice sites, either donor or acceptor, is entirely novel and has not been cataloged previously. Even if the other splice sites involved are known, the presence of a novel splice site means that the isoform falls under the NNC category. This subtype signifies isoforms that introduce new splicing patterns to the catalog due to the presence of previously unidentified splice sites.

NIC Combination of known SJ: In this category, the query isoform is formed by combining known splice junctions. In essence, even though the particular arrangement of splice junctions in the isoform is novel, all the individual splice junctions themselves have been previously cataloged.
NIC Combination of known splice site: This refers to a query isoform composed of known splice sites. Splice sites are specific nucleotide sequences at the ends of introns that signal the splicing machinery. A splice site can be either a donor (at the start of an intron) or an acceptor (at the end of an intron). Even if the precise configuration or combination in the isoform hasn't been seen before, the individual splice sites have been documented previously.
NIC Intron Retention: This scenario involves a query NIC isoform that retains an intron that is usually spliced out in other known isoforms.
NIC Mono-exon: This scenario involves a NIC isoform that consists of a single exon without any introns.

3.2.2 TSS (Transcription Start Site) Evaluation in SQANTI3 (Video link for this part)

The accurate identification of Transcription Start Sites (TSS) stands as a cornerstone in the realm of molecular biology and genomics. TSSs are the specific locations on the DNA where transcription of a particular gene commences. Recognizing these sites with precision is pivotal.

Distance from the Reference TSS:

First, it assesses how far a detected TSS falls from the reference TSS.
The proximity of your TSS to the reference can indicate its accuracy. If it's too far off, it may be a sign of an inaccurate TSS detection.

Using CAGE Peaks Data:

Cap Analysis Gene Expression (CAGE) is a technique to identify the TSS of genes. SQANTI uses CAGE peaks data to help distinguish between true TSS and potential false positives. If the detected TSS aligns with a CAGE peak, it's more likely to be a true TSS. In contrast, if it doesn't align with any CAGE peaks, it might be a false detection.
The TSS ratio is computed for each transcript by determining the ratio between short-read coverage downstream and upstream of the TSS. A true TSS is expected to have much lower upstream coverage, resulting in a TSS ratio greater than one. On the other hand, 5’ end-degraded transcripts are expected to have uniform coverage on both sides of the TSS, leading to a TSS ratio approximately equal to one

3.2.3 SQANTI3 Transcriptional termination site quality control (Video link for this part)

SQANTI3 uses three quality control metrics for evaluating transcription termination sites (TTS) in transcriptome:

Distance to Annotated TTS: Measures the gap between a transcript's TTS and the nearest known TTS from reference databases to spot novel or existing termination sites.
Polyadenylation (polyA) Motif Detection: Scans for a specific sequence near the 3' end of transcripts that signals for polyadenylation, serving as a biological marker for genuine TTS.
Quant-seq Data Support: If available, this data provides additional validation for the 3' ends, enhancing confidence in the identified TTS.

Check different TTS:

Distance to Annotated TTS:

What: SQANTI3 calculates the physical distance between the TTS of the newly sequenced transcript and the closest annotated TTS belonging to the same gene in reference databases.
Why: By doing this, SQANTI3 can evaluate if a transcript's TTS is novel or if it matches (or is proximate to) a previously known termination site. If a transcript's TTS is very distant from any annotated TTS, it may be a novel site or an artifact, and further validation may be needed. Polyadenylation (polyA) Motif Detection:

Polyadenylation (polyA) Motif Detection:

What: Transcripts typically have a sequence motif near their 3' end that signals for polyadenylation (addition of a polyA tail). SQANTI3 scans the transcript sequence to detect this motif.
Why: The presence of a polyA motif can indicate a genuine TTS, as the motif is a biological marker for transcript termination and subsequent addition of a polyA tail. These detected motifs are commonly found 16-18 base pairs from the end of the transcript, consistent with known experimental evidence. This adds an extra layer of validation. Absence or unusual positioning of this motif might suggest the 3' end of the transcript was inaccurately determined or that alternative polyadenylation sites exist.

Quant-seq Data Support (polyA site data):

What: Quant-seq is a method to sequence the 3' ends of RNAs and is used to validate the 3' ends obtained from other sequencing methods. If Quant-seq data is available, SQANTI3 considers it as orthogonal evidence to validate the TTS of a transcript.
Why: Quant-seq specifically targets the 3' end of transcripts, making it a reliable method to validate TTS. If both long-read sequencing (like from PacBio) and Quant-seq point to the same TTS, confidence in that TTS's accuracy increases.

3.2.3 SQANTI3 Splicing junction classification (Video link for this part)

When it comes to splice junctions, here's how SQANTI approaches the task:

Transcript Classification Based on Splice Junctions:

SQANTI classifies transcripts by comparing their splice junctions to those in a provided reference transcriptome.

Full Splice Match (FSM): Transcripts that match a reference transcript at all splice junctions.
Incomplete Splice Match (ISM): Transcripts that match consecutive, but not all, splice junctions of the reference transcripts. Those matching mostly the UTR3 sequence of their reference are labeled UTR3 Fragment.
Novel Transcripts: Transcripts introducing new splice patterns are divided into:
- Novel in Catalog (NIC): Contains either new combinations of already annotated splice junctions or novel ones formed from previously annotated donors and acceptors.
- Novel Not in Catalog (NNC): Utilizes novel donors and/or acceptors.

Canonical vs Noncanonical Splicing:

Splice junctions are categorized into canonical and noncanonical based on the dinucleotide pairs at their start and end.

Canonical splicing includes the combinations GT-AG, GC-AG, and AT-AC, found in over 99.9% of human introns.
All other combinations are considered noncanonical splicing.
Users can also supply their own definitions of canonical junctions.

Splice Junction Coverage with Short Reads:

SQANTI3 incorporates short-read data by internally running STAR and Kallisto for mapping and quantification of splice junction. The combination of both long and short reads provides a dual layer of evidence. If a novel splice junction is detected by both read types, there's increased confidence in its authenticity. Unlike SQANTI, SQANTI3 can process raw FASTQ files from Illumina directly, producing short read-related metrics. Here are the details of how SQANTI3 processes short reads data:

A genome index is prepared and short reads are mapped using STAR.
The mapping does not use a reference annotation, ensuring that SJ identification is not influenced by previous annotations.
STAR's parameters follow the ENCODE-DCC RNA-seq protocol. To enhance novel SJ detection, the --twopassMode option is turned on.
Post STAR execution, two files are generated per replicate: SJ.out.tab, which provides quantification of long-read-defined SJs by short reads, and a BAM file, utilized for the novel TSS ratio metric computation.

3.2.4 SQANTI3 Intra-priming (Video link for this part)

Researchers use specific oligos, anywhere between 20-35 bases long, called primers, to initiate the copying or amplifying process of DNA. Typically, an 'oligodT' primer is custom-designed to match the target sequence at the end of a transcript, which often has a long string of 'A's known as the polyA tail. However, given that there are also A-rich regions within the transcripts (not just at the end), the oligodT primer might inadvertently begin amplification from these internal regions instead. If these transcripts don't have a typical ending (consensus poly Adenilation site), SQANTI thinks they might be mistakes from internal priming and flags them. This transcripts can be eliminated using

Mechanisms Intra-priming:

Imagine your DNA is a long string of letters, with 'A' being one of them. Sometimes, there's a region in the DNA that has lots of 'A's, either 6 of them in a row or 12 out of a group of 20 letters.
When a small piece of DNA (or a primer) finds this A-rich area, it can stick to it and start copying the DNA from there.
This is called "internal priming" because it's happening inside the DNA sequence and not where it's supposed to be.

SQANTI3 checks for this by looking at transcripts and seeing if there's a very A-rich region (>= 80% A's in the last 20 letters) at their end.

3.2.5 SQANTI3 RT-switching (Video link for this part)

RT-switching refers to the "jumping" or switching of the reverse transcriptase enzyme between repeated regions of an RNA molecule due to its secondary structure. This can lead to the formation of cDNA molecules that don't accurately represent the original RNA sequence and can introduce artifacts when analyzing RNA sequences or structures.

Here's a step-by-step explanation of RT-switching:

Reverse Transcription Start: The process begins with the initiation of reverse transcription. This is where the reverse transcriptase enzyme (RT) starts transcribing RNA into complementary DNA (cDNA). The RNA molecule has a secondary structure and repeated regions which play a role in RT-switching.
RNA Secondary Structure and Repeated Regions: RNA molecules can form secondary structures due to complementary base pairing within the molecule itself. These structures may bring repeated regions into close proximity. When the RT enzyme encounters these structures, it might jump from one repeated region to another.
Formation of cDNA with Skipped Repeats: When the RT enzyme switches or "jumps" between repeated regions, the resulting cDNA molecule may not have all the repeated regions that were originally present in the RNA. Instead, the cDNA will possess only one copy of the direct repeats.
Novel Splice Junctions Appear: When the cDNA, which has undergone RT-switching, is sequenced or mapped, it might appear as though it has novel splice junctions. These are not true biological splice junctions but are artifacts caused by the RT-switching event.

SQANTI3 identifies these potential artifacts by checking for the presence of direct repeats in the sequence. This search is performed in a region of a user-defined number of bases around the splice junctions. If a matching patter is found within the exon and intron sequences, the junction is flagged as a RT-switching artifact, and all the isoforms containing this junction are classified as "RT-switching".

4. Executing and Understanding SQANTI3 Filter (Video link for this part)

SQANTI3 has enhanced its filtering capabilities by improving its machine learning-based filter and introducing a flexible rules-based strategy, allowing users to discern between true isoforms and artifacts more accurately.

4.1 SQANTI3 rules filter (Video link for this part)

SQANTI3 has a rules-based strategy for filtering, which is especially useful when defining TP and TN sets is challenging. Here, a JSON file is used to specify the characteristics of a reliable isoform. The file is comprised of rules, with each rule consisting of multiple requisites. All requisites of a rule must be met for a transcript to be deemed true. Multiple rules for the same category are evaluated independently, and a transcript only needs to pass one of these rules. SQANTI3 comes with a default JSON file that includes rules for a basic filter. However, users can create their own JSON files to define custom rules.

Default rules filter

{
    "full-splice_match": [
        {
            "perc_A_downstream_TTS":[0,59]
        }
    ],
    "rest": [
        {
            "perc_A_downstream_TTS":[0,59],
            "all_canonical":"canonical",
            "RTS_stage":"FALSE"
        },
        {   
            "perc_A_downstream_TTS":[0,59],
            "RTS_stage":"FALSE",
            "min_cov":3
        }
     ]
}

4.1.1 Rules filter example:

To filter a classification file and generate outputs with the prefix "filtered_results", you can use:

python sqanti3_filter.py rules --sqanti_class path/to/classification.txt -o filtered_results

Note: this will apply default filters and produce filtered files with the "filtered_results" prefix in the directory where the script is run.

Required Arguments:

--sqanti_class: Path to the SQANTI3 QC classification file.

Optional Arguments:

-h, --help: Displays help message and exits.
--isoAnnotGFF3 ISOANNOTGFF3: Path to the isoAnnotLite GFF3 file that is to be filtered.
--isoforms ISOFORMS: Path to the fasta/fastq isoform file that is to be filtered.
--gtf GTF: Path to the GTF file that is to be filtered.
--sam SAM: Path to the SAM alignment of the input fasta/fastq.
--faa FAA: Path to the ORF prediction faa file to be filtered by SQANTI3.
-o OUTPUT, --output OUTPUT: Specifies the prefix for output files.
-d DIR, --dir DIR: Sets the directory for output files. Default is the directory where the script was run.
-v, --version: Displays the program version number.
--skip_report: When supplied, the filter will not generate a report.
-j JSON_FILTER, --json_filter JSON_FILTER: Specifies the JSON file where filtering rules are expressed. By default, the filter uses utilities/filter/filter_default.json.

4.1.4 Understand how to create SQANTI3 rules

Creating Custom Filters:

Starting with SQANTI3 version 5.0, users can specify any number of filtering rules in a JSON file.
The structure involves defining rules for each structural category, with each rule made up of various requisites.
- Rules for a category are treated as logical OR.
- Requisites within a rule are treated as logical AND.
The "rest" category can be used to specify rules for all other unspecified categories.

Rule Structure in JSON:

Rules can be set for any numeric or character column in the classification file:
- Numeric columns: Define an interval ([X,Y]) or just the lower limit.
- Character/logical columns: Define which terms or values are accepted.

Example of User-defined Filter (Video link for this part):

Full Splice Match category:
- Isoforms must not be a potential intra-priming product. They should have less than 60 percent of genomic adenines in the downstream 20 BP window.
Incomplete Splice Match category:
- Isoform must be larger than two kilobases and shorter than 15 kilobases.
- Must be cataloged within the three prime fragments, five Prime fragments, or internal fragment subcategories.
Novel In the Catalog category:
- All splice junctions should be canonical or covered by at least 10 short reads.
Novel Not in the Catalog category:
- All splice junctions should be canonical.
- The absolute distance to the closest annotated transcriptional start sites and transcriptional termination sites must be 50 BP or less or covered by at least 10 short reads.
Rest of the Categories:
- Transcripts will be kept if:
  - They are not suspected of being an RT switching artifact.
  - All of their splice junctions are canonical.
  - The transcript is coding.
  - It is not an intra-priming product.
  - It is not a mono exon transcript.

An example JSON filter is provided that defines rules for different isoform categories, including "full-splice_match", "incomplete-splice_match", "novel_in_catalog", "novel_not_in_catalog", and "rest". This filter specifies various conditions for keeping isoforms, such as avoiding potential intrapriming products, setting length thresholds, and requiring canonical splice junctions.

In summary, SQANTI3 provides a flexible way for users to filter their transcriptomes based on specific criteria to curate high-quality transcript sets. This is facilitated through easy-to-define JSON-based rule sets.

4.2 SQANTI3 Machine Learning Filter (Video link for this part)

SQANTI3 has enhanced its machine learning-based filter (ML filter) from its predecessors. The filter, rooted in a random forest classifier, discerns between true isoforms and artifacts. The training relies on SQANTI3's QC attributes and the selection of true positive (TP) and true negative (TN) isoform sets. Users can now define these sets or let the software automatically pick them based on SQANTI3's classifications. By default, the filter considers NNC non-canonical isoforms as TN and reference match (RM) subcategories as TP. There's a balance in set size for training, and a default threshold is set where the transcript's likelihood of being a genuine isoform must be ≥0.7. Users can adjust this threshold and specify which isoforms are included or excluded. The filter now benefits from new validation metrics and additional data like CAGE-seq peaks, the short read-based TSS ratio metric, polyA motif details, and Quant-seq peaks. This allows for better artifact detection. Rules for model training can be customized to avoid overfitting. The classifier's output modifies the original *_classification.txt file to produce a *_ML_result_classification.txt file.

4.2.1 Example command:

python sqanti3_filter.py ml --sqanti_class path/to/classification.txt

Required arguments:

--sqanti_class: SQANTI3 QC classification file.

Optional arguments:

--isoAnnotGFF3: Specifies the isoAnnotLite GFF3 file to be filtered.
--isoforms: Designates the fasta/fastq isoform file for filtering.
--gtf: Indicates the GTF file for filtering.
--sam: Signifies the SAM alignment of the input fasta/fastq.
--faa: Represents the ORF prediction faa file for SQANTI3 filtering.
-o, --output: Prefix for output files.
-d, --dir: Output directory (default is the directory where the script was run).
-e, --filter_mono_exonic: Filters all mono-exonic transcripts by default.
-v, --version: Shows program version number.
--skip_report: Skips creating a filtering report.
-t, --percent_training: Data proportion for training (default is 0.8 or 80%).
-p, --TP: Path to file listing TP transcripts.
-n, --TN: Path to file listing TN transcripts.
-j, --threshold: Probability threshold to classify as positive isoforms (default is 0.7).
-f, --force_fsm_in: Forces the inclusion of FMS transcripts regardless of ML results.
--intermediate_files: Outputs intermediate files from ML filter.
-r, --remove_columns: Provides path to file with names of columns to be excluded during training.
-z, --max_class_size: Max number of isoforms for TP and TN sets (default is 3000).
-i, --intrapriming: Sets adenine percentage to flag an isoform as intra-priming (default is 60).

For an in-depth understanding of each parameter, you can delve into the detailed descriptions in the filter wiki page. The ML filter allows users to define TP and TN sets or utilize the built-in sets. After defining the sets, you'll split them for model training/testing. The ML filter will utilize the probability threshold to label transcripts as Isoforms or Artifacts.

Additionally, certain columns from the classification table are omitted before running the SQANTI3 ML filter. Users have the liberty to exclude more columns if needed. There's an intra-priming filter included in the script and options to enforce the removal or retention of specific isoform groups.

4.2.4 Understand how the machine learning filter works

In machine learning, especially for supervised learning tasks, training data is pivotal. The model's ability to accurately predict or classify unseen data depends largely on the quality and quantity of the training data. In the context of the SQANTI3's machine learning filter, let's delve deeper into the nuances of training data and the model.

1. True Positive (TP) and True Negative (TN) Sets

Selection Criteria for TP and TN sets:

True Positive (TP): Reliable isoforms from which the model learns the features of a genuine isoform. The built-in selection includes Reference Match (RM) isoforms, and if there are insufficient RM isoforms, Full-Splice Match (FSM) isoforms are taken as TP.
True Negative (TN): These are isoforms considered as low-quality or artifacts. The built-in selection encompasses Novel Not in Catalog (NNC) isoforms with at least one non-canonical junction. If this group is small, all NNC isoforms are used as TN.

Users can supply their own TP and TN sets, tailoring the training data to their specific requirements. If the user has prior knowledge about certain transcripts being genuine isoforms or artifacts, this functionality becomes valuable.

SQANTI3 looks for a balance in the training sets. If there's a significant disparity between the sizes of the TP and TN sets, the larger set is downsized through random sampling to match the smaller set size. This ensures balanced training and avoids bias towards the over-represented class.

2. Partitioning the Training Data

Before training the model, it’s important to divide the dataset into training and testing sets. Typically, 80% of the data is used for training, while the remaining 20% is set aside for testing. This approach allows the model to learn from the majority of the data while still being evaluated on data it hasn’t seen before, giving a clearer picture of how well it generalizes. To further strengthen the model’s reliability, SQANTI3 uses a 10-fold cross-validation technique during training. This means the training data is split into 10 equal parts—nine are used to train the model, and the remaining one is used for validation. This process repeats 10 times, cycling through each part. The goal is to reduce overfitting and ensure the model performs consistently across different subsets of data.

3. Random Forest Classifier

The classification itself is handled by a Random Forest model, which is a popular ensemble learning method. It works by combining the predictions of multiple decision trees, offering a balance between flexibility and accuracy. Random Forests are particularly good at managing complex datasets with many features. In SQANTI3, the model is trained using the caret package in R, and it draws on a variety of transcript-derived features like transcript length, number of exons, and junction patterns. After training, the model doesn’t just spit out a yes or no—it assigns each transcript a probability score indicating how likely it is to be a true isoform. By default, transcripts scoring above 0.7 are considered genuine.

To evaluate how well the model performs, the reserved 20% of testing data comes back into play. This separate data helps confirm whether the model's predictions hold up in practice. Once the model is trained and tested, it’s saved as an .RData file. If you run the filtering step again in the same directory, SQANTI3 will reuse this saved model to skip retraining. That’s useful when you want to apply a trained model to new data. However, if you want to train a fresh model instead, you'll need to move or delete the existing one first.

4. Feature Importance and Exclusion

Not all features in a dataset are equally useful—some may have little impact on the model’s performance, while others might even introduce unwanted bias. To address this, SQANTI3 offers the flexibility to exclude specific features or columns from the training process. This allows users to fine-tune the model by focusing only on the most informative data, helping to reduce noise and improve overall accuracy.

By leaving out certain columns, users can also guard against overfitting. If a model becomes too dependent on features that are only relevant within the training set, it may struggle to generalize to new data. Excluding such features ensures the model remains robust and applicable beyond the specific dataset it was trained on.

Note The random forest classifier has specific scenarios where it might not run, so be mindful of exceptions.

5. Executing and Understanding SQANTI3 Rescue (Video link for this part)

The SQANTI3 rescue module is designed to prevent the loss of transcripts from long-read data that could not be accurately processed. It aims to identify valid transcript models for transcripts initially discarded due to potential artifacts after filtering using the SQ3 filter module. Full documentation can be found in the rescue wiki page.

SQANTI3 rescue offers two main modes of operation: automatic and full. The automatic mode, which is the default, focuses on recovering high-confidence reference isoforms by identifying reference transcripts for which all corresponding Full Splice Match (FSM) isoforms were removed during the filtering stage, under the assumption that their junction structures are likely correct. On the other hand, the full mode extends the rescue process to consider ISM, NIC, and NNC artifacts as rescue candidates. This mode involves a more comprehensive strategy, including mapping these candidate artifacts to potential target transcripts from both the reference and long-read datasets using minimap2, followed by applying the SQANTI3 filter to the reference targets and a final selection of rescued transcripts based on the mapping results and filter outcomes.

5.1 Automatic rescue:

The automatic rescue process in SQANTI3 identifies and restores isoforms that were classified as artifacts during filtering but are likely to be biologically relevant. The process focuses on rescuing the reference transcripts where all of the full-splice match (FSM) isoforms have been removed. It only rescues mono-exonic FSM isoforms if the user specifies this option (--rescue_mono_exonic). If a reference transcript has an associated FSM isoform that was determined as artifact, it is rescued if it is not already present in the final set. This approach ensures that potentially important transcripts, particularly those matching reference annotations, are not lost due to stringent filtering criteria.

sqanti3_rescue.py rules --filter_class  example/rules_filter_output/ \
                        --refGTF data/reference/gencode.v38.basic_chr22.gtf \
                        --refFasta data/reference/GRCh38.p13_chr22.fasta \
                        --mode automatic \
                        --dir pablo_tests/rescue_automatic --output UHR_chr22

This will generate a list with the reference transcripts that have been rescued (if any) and the updated GTF file with the final rescued transcriptome. This can be further expanded by using the --mode full option, which will include all the ISM, NIC and NNC artifacts as potential rescue candidates.

Note: You will have to specify either rules or ml for the rescue mode, based on the filter strategy used and the corresponding classification file.

5.2 Full rescue:

The first step in the full rescue is to select the candidate isoforms and the target reference transcripts to rescue.

Candidate isoforms: These will be ISM, NIC and NNC isoforms that were classified as artifacts during the filtering step.
Target reference transcripts: In this group, we will include all of the isoforms that belong to the associated genes of the candidate isoforms. This includes the reference transcripts from those genes and the long-read isoforms that were classified as FSM.

Once the candidate isoforms and the target reference transcripts are selected, we will map the candidate isoforms to the target reference transcripts using minimap2. This mapping will be used to identify the best matching target transcript for each candidate isoform.

Before doing the final rescue, we need to treat the reference annotation as if it were a long-read dataset, in this case, the same one as our inupt. For that reason, you will have to run SQANTI3 QC with the same parameters and orthogonal data as you did with your long-read dataset. Once that is done, you will have to run rescue in either rules or ML mode, depending on the filter strategy you used.

Rules-based rescue

If the rules strategy was used, you will have to reuse the same JSON file that was used to filter the long-reads transcriptome.

sqanti3_rescue.py rules \
    --filter_class example/rules_filter_output/UHR_chr22_RulesFilter_result_classification.txt \
    --refGTF data/reference/gencode.v38.basic_chr22.gtf \
    -refFasta data/reference/GRCh38.p13_chr22.fasta \
    --rescue_isoforms example/SQANTI3_QC_output/UHR_chr22_corrected.fasta \
    --rescue_gtf example/rules_filter_output/UHR_chr22.filtered.gtf \
    --refClassif data/reference/gencode.v38.basic_chr22_classification.txt \
    --mode full \
    --json_filter src/utilities/filter/filter_default.json \
    --dir example/rescue_full_rules --output UHR_chr22

Machine learning-based rescue

If the ML strategy was used, in this case you will have to directly give SQANTI3 the random forest model that was trained during the filtering step. The rest of the parameters should be the same as with the rules-based rescue.

sqanti3_rescue.py ml \
    --filter_class example/rules_filter_output/UHR_chr22_ML_result_classification.txt \
    --refGTF data/reference/gencode.v38.basic_chr22.gtf \
    -refFasta data/reference/GRCh38.p13_chr22.fasta \
    --rescue_isoforms example/SQANTI3_QC_output/UHR_chr22_corrected.fasta \
    --rescue_gtf example/rules_filter_output/UHR_chr22.filtered.gtf \
    --refClassif data/reference/gencode.v38.basic_chr22_classification.txt \
    --mode full \
    --random_forest example/ml_filter_output/randomforest.RData \
    --dir example/rescue_full_rules --output UHR_chr22

5.2.2 Rescued isoforms selection

Once we have the mapping of the candidate isoforms to the targets and the reference annotation has been filtered as the long-reads dataset, the final step of rescue is to select the transcripts that will be rescued. The reintroduction of targets into the final transcriptome is done if they pass the following steps:

Pass the applied SQANTI3 filter criteria (independently of being a reference transcript/gene or a long-read isoform).
In the case of having two possible targets where one is a long-read isoform and the other is a reference transcript, the latter will be selected, since the best matching transcript is already part of the transcriptome.
The target transcript will be rescued if it does not introduce redundancy into the final transcriptome. This means that if the target transcript is already present in the final transcriptome, it will not be rescued again.

5.2 The Main Principles of SQANTI3 rescue:

Consistent Quality: Rescued transcript models should satisfy the quality control (QC) standards set during the filtering phase.
Non-redundancy: If the replacement transcript (identified for an artifact) already exists in the filtered transcriptome, it isn't added again.

5.3 Understand the Process of SQANTI3 rescue:

Automatic Rescue:
- Targets FSM (Full Splice Match) artifacts arising when there's insufficient confidence in validating Transcription Start Sites (TSS) or Transcription Termination Sites (TTS).
- If a reference transcript relates to a removed FSM, it's automatically added to the transcriptome. However, if multiple FSMs point to the same reference but have varied TSS, the reference is added just once.
- Artifacts from ISM, NIC, and NNC categories are seen as potential rescue candidates. ISM artifacts are considered only if they don't have an associated FSM with the same reference transcript.
Rescue Target Identification:
- Defines a group of "rescue targets" or potential replacements for the rescue candidates. This encompasses both long read-defined and reference isoforms that have at least one same-gene rescue target.
- Candidates are mapped to the targets in a "rescue-by-mapping" process. The tool used here is minimap2, set for long-read alignment.
Quality Control:
- Ensures the reference transcriptome targets used in the rescue adhere to the same QC standards as the long-read targets.
- The SQANTI3 QC and Filter modules (machine learning or rules) are applied on the reference transcriptome, ensuring the same data and quality standards as were used for the long-read transcript models.
Final Selection:
- Filters out mapping hits if the rescue target didn't pass the established filters.
- If multiple valid mapping hits exist per candidate, the one with the highest probability (from either long-read or reference) will be chosen. If more than one hit passes the rules filter, all targets are considered.
- If the best matching target is a long read-defined isoform already in the transcriptome, nothing changes since the artifact is already represented.
- The remaining reference targets are checked for redundancy and added only if they aren't present in the already curated transcriptome.

In essence, the SQANTI3 rescue module seeks to ensure no valuable transcript data is lost while maintaining stringent quality and non-redundancy principles.