Running SQANTI3 rescue - ConesaLab/SQANTI3 GitHub Wiki
As of SQANTI3 v5.1, a new module has been added to the SQANTI3 workflow for transcriptome characterization and quality control: SQANTI3 rescue.
The SQANTI3 rescue algorithm is designed to be run after transcriptome filtering and uses the long read-based evidence provided by discarded isoforms (i.e. artifacts) to recover transcripts in the associated reference transcriptome. The idea behind this strategy is to avoid losing transcripts/genes that are detected as expressed by long read sequencing, but whose start/end/junctions could not be confidently validated using orthogonal data, resulting in the removal of those genes/transcripts from the transriptome. More details about this can be found in the Motivation section below.
In particular, during the rescue, SQANTI3 will try to confidently assign each discarded artifact to the best matching reference transcript. As a result, SQANTI3 rescue will generate an expanded transcriptome GTF including a set of reference transcripts as well as the long read-defined isoforms that passed the filter. Optionally, requantification can be performed to reassign expression values to the rescued isoforms.

The SQANTI3 rescue algorithm consists in the following steps:
As explained above, the rescue strategy in SQ3 was conceived to recover transcriptome diversity lost during filtering. This, among other things, means verifying that mone of the reference-supported junction chains that were initially detected by long-reads are lost due to stringent artifact removal.
To achieve this during automatic rescue, all reference transcripts that were represented by at least one FSM in the original post-QC transcriptome are first retrieved -note that this information is available in the associated_transcript
column of the *_classification.txt
file. Then, those reference transcripts for which all FSM representatives were removed by the filter are rescued.
The previous analytic decision is justified because, in practice, any case where all FSMs with the same associated_transcript
are removed can be interpreted as follows: 1) the TSS and/or TTS of the long read-defined transcript is different from that of the matching reference transcript, however, it could not be validated by SQ3 QC-supplied orthogonal data; and 2) the junctions are identical to those found in the reference, which can be interpreted as evidence that this isoform is real. As a result, SQ3 will not rescue any of the discarded FSMs, but the associated_transcript
from the reference.
By default, SQANTI3 Rescue runs under --mode automatic
. As a result, this is the only rescue step that SQANTI3 Rescue performs. The complete algorithm can be run by the --mode full
argument.

In spite of its potential to achieve the goals that we set for the rescue, the previous strategy does not consider ISM, NIC or NNC artifacts. These will be included in the rescue candidate group, i.e. transcripts classified as artifacts for which SQANTI3 will try to find a matching reference transcript to include in the final, curated transcriptome.
-
For ISM artifact transcripts, there are two possible situations:
-
There will be cases where the same discarded reference transcript is supported by FSM and ISM. ISM artifacts with an FSM artifact counterpart will therefore be "collapsed" into the rescued reference transcript during the automatic rescue step.
-
Conversely, there will be
associated_transcript
references that are only supported by one or more ISM. Those ISM artifacts that constituted evidence of a non-FSM supported reference will therefore be included in the rescue candidate list.
-
-
For novel transcripts from the NIC and NNC categories, since there is no associated transcript information, all transcripts classified as artifacts will be included in the rescue candidate list.
As a result, we consider all reference or long read-defined transcripts from genes that have at least one rescue candidate to be rescue targets.
SQ3 rescue next tries to find matches between each rescue target and its same-gene candidates based on sequence similarity. To achieve this, we perform an internal mapping step using minimap2. In it, rescue candidates are considered to be "reads" and rescue targets are used as a "reference genome" in which each transcript sequence constitutes a different "chromosome".
To map candidates, we use the map-hifi
option in minimap2 and the -a -x
parameters:
minimap2 --secondary=yes -ax map-hifi rescue_targets.fasta rescue_candidates.fasta > mapped_rescue.sam
Finally, all candidate-target pairs obtained during mapping -referred to as mapping hits- are obtained from the output SAM file regardless of whether they are primary or secondary alignments.

Validation of transcripts using orthogonal sources of data is an important part of the SQANTI3 philosophy. In consequence, the rescue strategy in SQ3 includes a validation step for all reference rescue targets before considering them for inclusion in the transcriptome, since no QC information is available for them (in contrast to long read-defined targets).
This requires users to run SQANTI3 quality control on the reference transcriptome and supply the output *_classification.txt
file to SQANTI3 rescue using the --refClassif
(-k
) flag. This must be done using the same orthogonal data files that were used when running SQANTI3 QC for the long read-transcriptome, since the rescue is based on the assumption that the same evidence is required to validate all rescue targets.
Using the supplied reference classification file, SQ3 rescue will next apply SQ3 filter to the reference transcriptome. The filter to be applied will be specified by the rules
or ml
flags used when running the rescue. This means that, if you run SQ3 machine learning-based filter, you should also run the rescue using the ml
option (and the same is true for the rules filter).

In the final rescue step, hereby referred to as rescue-by-mapping, mapping hits (i.e. candidate-target pairs obtained during mapping) and reference transcriptome filter results are combined to generate a final list of reference transcripts to be rescued.
This part of the rescue can be divided into the following tasks/criteria:
-
SQ3 filtering of targets: first, candidate-target pairs are removed if the rescue target did not pass the filter (ML or rules). If the user is working with the rules filter, this will mean that the reference (or long-read) target transcript did not pass the rules defined in the JSON file. If the user is working with the ML filter, this will mean that the target transcript did not pass the specified ML probability threshold. In the case of the ML filter, however, only the target transcript(s) with maximum ML probability are selected to continue in the rescue.
-
Select reference targets: as explained above, both long read-defined and reference target transcripts are considering during mapping. As a result, rescue candidates (i.e. artifacts) can match to either of these transcripts. From this step onwards, however, only targets selected from the reference transcriptome are considered. This is done to avoid introducing redundancy in the final transcriptome: if a rescue candidate's perfect match was another long read-defined counterpart from the same gene, there is no need to include another transcript as a rescued representative, since the best matching transcript is already part of the transcriptome.
-
Remove redundant reference transcripts: during rescue, there is a chance that some of the reference transcripts that we wish to include in our transcriptome is already represented by an FSM from the same gene (or that it has been already retrieved during automatic rescue). In this situation, target reference transcripts that are already represented by a long read-defined isoform are not to be included in the transcriptome in order to avoid increasing redundancy. In other words -and similarly to what was previously stated-, these are cases in which the artifact's (i.e. rescue candidate) best matching transcript is already represented by an isoform in the transcriptome.
After performing this last filter of the rescue target list, SQ3 rescue outputs a list of rescued reference transcripts, which are then added to the long-read transcriptome GTF.

The requantification is applied to each transcript following one of the four scenarios:
-
A transcript was not impacted by either filtering or the rescue step; the expression level of such a transcript is intact.
-
A transcript was flagged by the filter as an artifact and was not rescued; the expression level is ignored.
-
A transcript was initially discarded and later rescued by the reference transcript; the initial expression level is transferred to the reference transcript.
-
One of the qualifying transcripts did not pass the redundancy test and was not included in the final rescue isoforms list, meaning that the transcript is already represented by a transcript model that was not affected by the filtering step. The expression levels from artifacts are summed up with the expression levels of the unaffected transcript.
Similarly to the SQANTI3 filter, the SQANTI3 rescue is designed as a
dual implementation, depending on whether the rules or the machine learning filter was previously run. Therefore, the sqanti3_rescue.py
script requires a flag to be provided to activate either the ml
or rules
specific rescue.
usage: sqanti3_rescue.py [-h] {ml,rules} ...
Rescue artifacts discarded by the SQANTI3 filter, i.e. find closest match for
the artifacts in the reference transcriptome and add them to the
transcriptome.
positional arguments:
{ml,rules}
optional arguments:
-h, --help show this help message and exit
-
Long read-defined isoforms: should be the FASTA file generated by SQANTI3 QC (
*_corrected.fasta
). The isoform sequences will be used to extract rescue candidates and targets for mapping, and must be supplied via the--rescue_isoforms
argument. -
Long read-defined transcriptome annotation (filtered): should be the GTF file output by SQANTI3 filter (
*.filtered.gtf
). Rescued transcripts will be appended to this GTF file to generate the final curated transcriptome. This file must be supplied via the--rescue_gtf
argument. -
Reference transcriptome annotation: reference GTF used to run SQANTI3 QC. This file will be used to extract rescue targets for mapping, and must be supplied via the
--refGTF
(or-g
) argument. -
Reference transcriptome classification file generated after running SQANTI3 QC on the reference transcriptome, which must be done previously to running the rescue and using the same orthogonal data as for long read-defined transcriptome QC. This file must be supplied via the
--refClassif
(or-k
) argument and will be used to evaluate reference rescue target support (see details above). -
Counts file containing expression values of transcript models before SQANTI3 filtering and rescue. This file is required if the
--requant
(-q
) module is used. It must include two columns: the first for the isoform ID and the second for the corresponding expression value. Column headers can vary. Provide this file using the--counts
(or-c
) argument; it will be used to re-evaluate the expression values of the filtered and rescued transcript models.
Additionally, the following parameters can be set to modify the behavior of the rescue algorithm:
-
Rescue mode: the
--mode
argument can be used to extend the rescue algorithm to non-FSM artifacts by supplying--mode full
. By default (--mode automatic
), only the automatic rescue step is performed, enabling the rescue of high-confidence reference isoforms. -
Define rescue for mono-exon transcripts: the
-e
flag can be used to determine whether (and how) to perform the rescue on mono-exonic transcripts. The following options can be supplied:-
all
(default): all mono-exon transcripts classified as artifacts will be considered as rescue candidates, regardless of the category that they are classified into. -
fsm
: mono-exonic transcripts will only be considered for the rescue if they belong to the full-splice match (FSM) category. In this case, mono-exons will be rescued via the automatic rescue strategy. -
none
: do not consider mono-exonic transcripts for the rescue (all mono-exons classified as artifacts will be discarded and no reference representatives included in the final transcriptome).
-
-
Output directory and outfile prefix: the output directory can be set via the
-d
flag (default: current directory). Output files will be named using the prefix provided using the-o
argument (default: SQANTI3).
-
Rules JSON file: using the
-j
flag, the user must provide the JSON file used for running SQANTI3 rules filter on the long read-defined transcriptome. This same set of rules will be applied to the reference transcriptome to evaluate the reliability of reference rescue targets. Note that, to achieve this, SQANTI3 rescue requires the classification file output after running SQANTI3 QC on the reference to be provided via the--refClassif
argument.
All in all, these are the arguments accepted by sqanti3_rescue.py rules
:
usage: sqanti3_rescue.py rules [-h] --filter_class FILTER_CLASS --refGTF REFGTF --refFasta REFFASTA [--rescue_isoforms RESCUE_ISOFORMS] [--rescue_gtf RESCUE_GTF] [-k REFCLASSIF] [-e {all,fsm,none}]
[--mode {automatic,full}] [-o OUTPUT] [-d DIR] [-c CPUS] [-v] [-l {ERROR,WARNING,INFO,DEBUG}] [-j JSON_FILTER] [-q] [-c EXPRESSION_VALUES]
Arguments explanation
Rescue for rules-filtered transcriptomes.
options:
-h, --help show this help message and exit
Required arguments:
--filter_class FILTER_CLASS
SQANTI filter (ML or rules) output classification file.
--refGTF REFGTF Full path to reference transcriptome GTF used when running SQANTI3 QC.
--refFasta REFFASTA Full path to reference genome FASTA used when running SQANTI3 QC.
Input options:
--rescue_isoforms RESCUE_ISOFORMS
FASTA file output by SQANTI3 QC (*_corrected.fasta), i.e. the full long read transcriptome.
--rescue_gtf RESCUE_GTF
GTF file output by SQANTI3 filter (*.filtered.gtf).
-k REFCLASSIF, --refClassif REFCLASSIF
Full path to the classification file obtained when running SQANTI3 QC on the reference transcriptome.
Mandatory when running the rescue on full mode
-q, --requant
Run requantification to reassign expression values to the rescued transcripts.
-c, --counts
Full path to the file containing transcript expression values before SQANTI3 filtering and rescue.
Customization options:
-e {all,fsm,none}, --rescue_mono_exonic {all,fsm,none}
Whether or not to include mono-exonic artifacts in the rescue.
Default: all
--mode {automatic,full}
If 'automatic' (default), only automatic rescue of FSM artifacts will be performed.
If 'full', rescue will include mapping of ISM, NNC and NIC artifacts to find potential replacement isoforms.
Output options:
-o OUTPUT, --output OUTPUT
Prefix for output files.
-d DIR, --dir DIR Directory for output files. Default: Directory where the script was run.
Extra options:
-c CPUS, --cpus CPUS Number of CPUs to use. Default: 4
-v, --version Display program version number.
-l {ERROR,WARNING,INFO,DEBUG}, --log_level {ERROR,WARNING,INFO,DEBUG}
Set the logging level INFO
Rules specific options:
-j JSON_FILTER, --json_filter JSON_FILTER
Full path to the JSON file including the rules used when running the SQANTI3 rules filter.
Default: /home/pabloati/Programs/sqanti3/src/utilities/filter/filter_default.json
In addition to the common arguments, the machine learning rescue requires the following **specific files**:
-
Pre-trained random forest classifier: using the
-r
flag, users must provide therandomforest.RData
object that was obtained when running the machine learning filter on the long read-defined transcriptome. This classifier will be used to run the ML filter on reference transcriptome isoforms, i.e. obtain an ML probability value for reference rescue targets and evaluate their likelihood to be a true isoform based on orthogonal data supplied to SQANTI3 QC. Note that, to achieve this, SQANTI3 rescue requires the classification file output after running SQANTI3 QC on the reference to be provided via the--refClassif
argument. -
ML probability threshold: minimum probability value that is required in order to consider a transcript to be a true isoform. It should be set using the
-j
flag (default: 0.7). We recommend setting the same threshold that was used when running SQANTI3 filter.
All in all, these are the arguments accepted by sqanti3_rescue.py ml
:
usage: sqanti3_rescue.py ml [-h] --filter_class FILTER_CLASS --refGTF REFGTF --refFasta REFFASTA [--rescue_isoforms RESCUE_ISOFORMS] [--rescue_gtf RESCUE_GTF] [-k REFCLASSIF] [-e {all,fsm,none}]
[--mode {automatic,full}] [-o OUTPUT] [-d DIR] [-c CPUS] [-v] [-l {ERROR,WARNING,INFO,DEBUG}] [-r RANDOM_FOREST] [-j THRESHOLD]
Arguments explanation
Rescue for ML-filtered transcriptomes.
options:
-h, --help show this help message and exit
Required arguments:
--filter_class FILTER_CLASS
SQANTI filter (ML or rules) output classification file.
--refGTF REFGTF Full path to reference transcriptome GTF used when running SQANTI3 QC.
--refFasta REFFASTA Full path to reference genome FASTA used when running SQANTI3 QC.
Input options:
--rescue_isoforms RESCUE_ISOFORMS
FASTA file output by SQANTI3 QC (*_corrected.fasta), i.e. the full long read transcriptome.
--rescue_gtf RESCUE_GTF
GTF file output by SQANTI3 filter (*.filtered.gtf).
-k REFCLASSIF, --refClassif REFCLASSIF
Full path to the classification file obtained when running SQANTI3 QC on the reference transcriptome.
Mandatory when running the rescue on full mode
Customization options:
-e {all,fsm,none}, --rescue_mono_exonic {all,fsm,none}
Whether or not to include mono-exonic artifacts in the rescue.
Default: all
--mode {automatic,full}
If 'automatic' (default), only automatic rescue of FSM artifacts will be performed.
If 'full', rescue will include mapping of ISM, NNC and NIC artifacts to find potential replacement isoforms.
Output options:
-o OUTPUT, --output OUTPUT
Prefix for output files.
-d DIR, --dir DIR Directory for output files. Default: Directory where the script was run.
Extra options:
-c CPUS, --cpus CPUS Number of CPUs to use. Default: 4
-v, --version Display program version number.
-l {ERROR,WARNING,INFO,DEBUG}, --log_level {ERROR,WARNING,INFO,DEBUG}
Set the logging level INFO
Machine Learning specific options:
-r RANDOM_FOREST, --random_forest RANDOM_FOREST
Full path to the randomforest.RData object obtained when running the SQANTI3 ML filter.
-j THRESHOLD, --threshold THRESHOLD
Machine learning probability threshold to filter elegible rescue targets (mapping hits).
Default: 0.7
The final result of SQANTI3 rescue is a **transcriptome GTF file** - named `prefix_rescued.gtf` - including all the transcripts that were classified as isoforms by either the rules or the ML filter (already included in the supplied `prefix.filtered.gtf` file) and the rescued transcript isoforms that were selected from the reference transcriptome (see [rescue strategy](https://github.com/ConesaLab/SQANTI3/wiki/Running-SQANTI3-rescue#strategy) section above).
Other additional output files that are generated in the rescue process will be explained below.
- **Automatic rescue result** (both `--mode` options): a single-column file named `prefix_automatic_rescued_list.tsv` containing the IDs of reference transcripts that were selected during automatic rescue.-
Rescue candidate files (for
--mode full
): Briefly, rescue candidates are a group artifact transcripts from the ISM, NIC and NNC categories (which did not undergo automatic rescue) for which a reference transcript match needs to be found. Two rescue candidate files are generated:- A rescue candidate ID list, named
prefix_rescue_candidates.tsv
. - A rescue candidate FASTA file named
prefix_rescue_candidates.fasta
containing the sequences of rescue candidates for mapping.
- A rescue candidate ID list, named
-
Rescue target files (for
--mode full
). As explained above, rescue targets are those reference and long read-defined transcripts for which there is a same-gene rescue candidate counterpart. During the rescue, candidates are mapped to targets using minimap2. Similarly to rescue candidates, two rescue target files are generated:- ID list:
prefix_rescue_targets.tsv
. - FASTA file:
prefix_rescue_targets.fasta
.
- ID list:
-
Mapping results (for
--mode full
): as a result of mapping with minimap2, two files are generated by SQ3 rescue:- Minimap2 SAM file containing the mapping results, named
prefix_mapped_rescue.sam
. - A three-column table in which the mapping results are summarized, named
prefix_rescue_mapping_hits.tsv
. This table has the following format:
- Minimap2 SAM file containing the mapping results, named
PB.1.3 0 PB.7707.5
PB.1.3 256 ENST00000557932.5
PB.1.3 256 ENST00000558784.5
PB.1.3 256 ENST00000442898.5
PB.1.3 256 ENST00000692602.1
PB.1.3 256 ENST00000686495.1
Here, the first column contains the rescue candidate ID, the second column includes the SAM flag indicating the type of alignment and the last column indicates the ID of the rescue target that the candidate was successfully mapped to.
- Rescue summary table:
For mode --automatic
(default), the prefix_rescue_table.tsv
file shows the artifacts considered for automatic rescue, and the reference transcript that they have been replaced with:
- Column 1 (
artifact
): artifact ID. Includes filtered FSM and ISM associated to the same references. - Column 2 (
rescued_transcript
): reference ID for the re-introduced ref. transcript. All filtered FSM and ISM associated to the same reference will be considered as supporting the same rescue event. - Column 3 (
structural_category
): indicates whetherartifacts
are FSM or ISM.
For mode --full
the prefix_rescue_table.tsv
file contains a table in which rescue outcome for all the target-candidate pairs included in the mapping hit table is detailed:
-
Column 1 (
rescue_candidate
): rescue candidate ID. -
Column 2 (
sam_flag
): SAM flag indicating the type of alignment. -
Column 3 (
mapping_hit
): ID of the rescue target that the candidate was successfully mapped to. -
Column 4 shows the filter result for the mapping hit (or rescue candidate), and may be named
hit_POS_MLprob
if theml
mode was used, orhit_filter_result
if therules
mode was used.- For ML, the
hit_POS_MLprob
column will contain the probability that the hit is a true isoform provided by the pre-trained random forest classifier. - For rules, the
hit_filter_result
column will read Isoform or Artifact depending on whether or not the hit passed the rules defined in the JSON file.
- For ML, the
-
Column 5 (
candidate_structural_category
): SQANTI3 category that the rescue candidate (i.e. the artifact that is being considered for rescue) belongs to. -
Column 6 (
rescue_result
): final rescue decision. Will berescued
ornot_rescued
depending on whether or not the isoform was finally included in the rescued transcriptome GTF -
Column 7 (
exclusion_reason
): ifrescue_result == not_rescued
, this column will include a flag explaining why the rescue target was not included in the rescue. On the contrary, ifrescue_result == rescued
, the column will containNA
. Possible exclusion reasons are:-
long_read_transcript
if the mapping hit is a long read-defined isoform, and is therefore already present in the transcriptome. -
reference_already_present
if the mapping hit is a reference transcript that is already represented by an isoform, be it an FSM that passed the filter or a transcript that was obtained during automatic rescue. - For ML rescue,
ML_probability
if the mapping hit did not pass the supplied ML probability threshold. Alternatively, for rules rescue, the flag will readartifact_by_rules
, meaning that the mapping hit did not pass the rules specified in the JSON file.
-
-
Column 8 (
best_match_for_candidate
): rescue candidate-level (i.e. transcript-level) summary of the rescue. For each rescue candidate ID, a flag is included summarizing all the decisions that were made regarding all potential rescue targets (mapping hits):-
reference_transcript
: in cases where one or more reference transcripts have been effectively rescued, or where the rescue target was already represented by another transcsript (i.e. thereference_already_present
case described above). -
long_read_transcript
: whenever no reference transcripts were validated for a given rescue candidate, but at least one good long read-defined match (i.e. passing ML probability/rules) was detected -even if not rescued. -
unknown
: this flag is assigned if none of the rescue targets selected for the artifact (i.e. the candidate) pass the ML probability threshold or the specified rules.
-
-
Column 9 (
best_match_id
): for each rescue candidate ID, this column will contain the transcript ID of the candidate's best match, as found by the rescue. This will be defined by the following rules:- Normally, the best match will be the rescue target with the maximum ML filter probability (above the specified threshold) or the one that passed the rules defined in the JSON file, depending on whether the rescue was run in the
ml
orrules
modes. - In case of ties (i.e. multiple targets have scored the maximum ML probability or passed the filter rules), the primary alignment obtained during mapping of candidates vs targets will be selected, if present. If none of the ties include the primary alignment, all transcripts that passed the probability/rules criteria will be reported.
- Finally, in case of
best_match_for_candidate == unknown
, the best match ID will also beunknown
.
- Normally, the best match will be the rescue target with the maximum ML filter probability (above the specified threshold) or the one that passed the rules defined in the JSON file, depending on whether the rescue was run in the
An example of what this table would look like for one rescue candidate and all its assigned mapping hits/rescue targets (in this case for the ml
rescue mode) can be viewed here:
rescue_candidate sam_flag mapping_hit hit_POS_MLprob candidate_structural_category rescue_result exclusion_reason best_match_for_candidate best_match_id
PB.1.3 0 PB.7707.5 0.754 novel_not_in_catalog not_rescued long_read_transcript reference_transcript ENST00000692602.1
PB.1.3 256 ENST00000557932.5 0.93 novel_not_in_catalog not_rescued reference_already_present reference_transcript ENST00000692602.1
PB.1.3 256 ENST00000558784.5 0.098 novel_not_in_catalog not_rescued ML_probability reference_transcript ENST00000692602.1
PB.1.3 256 ENST00000442898.5 0.18 novel_not_in_catalog not_rescued ML_probability reference_transcript ENST00000692602.1
PB.1.3 256 ENST00000692602.1 0.998 novel_not_in_catalog not_rescued reference_already_present reference_transcript ENST00000692602.1
PB.1.3 256 ENST00000686495.1 0.066 novel_not_in_catalog not_rescued ML_probability reference_transcript ENST00000692602.1
PB.1.4 0 PB.7707.5 0.754 novel_not_in_catalog not_rescued long_read_transcript reference_transcript ENST00000692602.1
-
Rescue inclusion list (
--mode full
): a single-column file containing the IDs of all rescued transcripts, namedprefix_rescue_inclusion-list.tsv
. In this list, both the IDs of rescued transcripts obtained via automatic rescue and via rescue-by-mapping are combined into a single inclusion list. This list is used to generate the final rescued transcriptome GTF, i.e.prefix_rescued.gtf
.
isoform POS_MLprob
DQ459430 0.25
DQ516784 0.262
DQ516752 0.27
DQ668364 0.262
DQ883670 0.804
EF011062 0.512
DQ875385 0.758