SQANTI‐Like Isoform Structure Comparisons - MethodsDev/LongReadAlignmentAssembler GitHub Wiki
LRAA includes a SQANTI-like isoform structure comparative analysis utility for comparing aligned reads or reconstructed transcripts to a reference isoform structure database (such as using GENCODE annotations for the human transcriptome).
Note, our SQANTI-like annotation system is available for convenience to users of LRAA and not meant to be a substitute to using SQANTI3. Users are encouraged to explore the more feature-filled SQANTI3 system as well.
Our SQANTI-like classifications are illustrated below.
The LRAA SQANTI-like categories are defined as follows:
Multi-exon reads
- Full Splice Match (FSM) : the full chain of introns spliced from the read match to the full chain of introns spliced in the reference transcript.
- Incomplete Splice Match (ISM): the chain of introns spliced from the read partially but sequentially match the full chain of introns spliced in the reference transcript.
- Novel In Category (NIC) : not FSM or ISM but otherwise all splice sites in the read match reference splice sites.
- Novel Not In Category (NNIC) : some but not all splice sites evident from the read alignment match the reference splice sites (highlighted in the above image by '*')
- genic : multi-segment read alignment matches exons but no splice site matches the reference.
- intronic : multi-segment read alignment found entirely within a reference intronic region with no exon overlap.
- antisense : multi-segment read alignment overlaps a reference exonic region on the opposite strand
- intergenic : no overlapping of coordinate spans of reference genes
Single-exon reads
- se_FM : single-exon alignment matches to a single exon reference gene and overlaps by at least 90% of its length.
- se_IM : single-exon alignment matches to single exon reference gene but fails to meet the se_FM criteria above.
- se_genic : single-exon alignment matches to a multi-exon reference gene and overlaps introns and exons
- se_exonic : single-exon alignment matches to a single exon region of a multi-exon gene (usually resulting from partial sequencing matching at 3' ends due to RNA degradation or other incompleteness issue. There are enough of these that they deserve their own category)
- se_intronic : single-exon alignment matches only within the intron of a multi-exon reference transcript
- se_antisense : single-exon alignment overlaps an exon of any gene on the opposite strand and doesn't fit in other categories above.
- se_intergenic : single-exon alignment does not overlap any reference gene span on either strand.
Running comparisons
To capture category assignments for individual reads, run the following:
LongReadAlignmentAssembler/util/SQANTI-like_cats_for_reads_or_isoforms.py \
--ref_gtf ${reference_gtf_file} \
--bam ${minimap2_aligned_reads.bam} \
--output_prefix sqanti-like
Outputs
Outputs include:
- ${output_prefix}.iso_cats.tsv.gz : every read name and SQANTI3-like category assignment.
- ${output_prefix}.iso_cats.summary_counts.tsv.gz : totals for read counts according to each category
- ${output_prefix}.iso_cats.bam : annotations included in the bam file: CL:Z:${SQANTI3-like_category} CI:Z:${isoform_ids_comma_delim}
- ${output_prefix}.iso_cats.summary_counts.pdf : plots summarizing the counts according to multi-exon and single-exon read categories.
An example ${output_prefix}.iso_cats.summary_counts.pdf is shown below: