SQANTI‐Like Isoform Structure Comparisons - MethodsDev/LongReadAlignmentAssembler GitHub Wiki

LRAA includes a SQANTI-like isoform structure comparative analysis utility for comparing aligned reads or reconstructed transcripts to a reference isoform structure database (such as using GENCODE annotations for the human transcriptome).

Note, our SQANTI-like annotation system is available for convenience to users of LRAA and not meant to be a substitute to using SQANTI3. Users are encouraged to explore the more feature-filled SQANTI3 system as well.

Our SQANTI-like classifications are illustrated below.

The LRAA SQANTI-like categories are defined as follows:

Multi-exon reads

Full Splice Match (FSM) : the full chain of introns spliced from the read match to the full chain of introns spliced in the reference transcript.
Incomplete Splice Match (ISM): the chain of introns spliced from the read partially but sequentially match the full chain of introns spliced in the reference transcript.
Novel In Category (NIC) : not FSM or ISM but otherwise all splice sites in the read match reference splice sites.
Novel Not In Category (NNIC) : some but not all splice sites evident from the read alignment match the reference splice sites (highlighted in the above image by '*')
genic : multi-segment read alignment matches exons but no splice site matches the reference.
intronic : multi-segment read alignment found entirely within a reference intronic region with no exon overlap.
antisense : multi-segment read alignment overlaps a reference exonic region on the opposite strand
intergenic : no overlapping of coordinate spans of reference genes

Single-exon reads

se_FM : single-exon alignment matches to a single exon reference gene and overlaps by at least 90% of its length.
se_IM : single-exon alignment matches to single exon reference gene but fails to meet the se_FM criteria above.
se_genic : single-exon alignment matches to a multi-exon reference gene and overlaps introns and exons
se_exonic : single-exon alignment matches to a single exon region of a multi-exon gene (usually resulting from partial sequencing matching at 3' ends due to RNA degradation or other incompleteness issue. There are enough of these that they deserve their own category)
se_intronic : single-exon alignment matches only within the intron of a multi-exon reference transcript
se_antisense : single-exon alignment overlaps an exon of any gene on the opposite strand and doesn't fit in other categories above.
se_intergenic : single-exon alignment does not overlap any reference gene span on either strand.

Running comparisons

To capture category assignments for individual reads, run the following:

    LongReadAlignmentAssembler/util/SQANTI-like_cats_for_reads_or_isoforms.py \
           --ref_gtf ${reference_gtf_file} \
           --bam ${minimap2_aligned_reads.bam} \
           --output_prefix sqanti-like

Outputs

Outputs include:

${output_prefix}.iso_cats.tsv.gz : every read name and SQANTI3-like category assignment.
${output_prefix}.iso_cats.summary_counts.tsv.gz : totals for read counts according to each category
${output_prefix}.iso_cats.bam : annotations included in the bam file: CL:Z:${SQANTI3-like_category} CI:Z:${isoform_ids_comma_delim}
${output_prefix}.iso_cats.summary_counts.pdf : plots summarizing the counts according to multi-exon and single-exon read categories.

An example ${output_prefix}.iso_cats.summary_counts.pdf is shown below: