SQANTI‐Like Isoform Structure Comparisons - MethodsDev/LongReadAlignmentAssembler GitHub Wiki

LRAA includes a SQANTI-like isoform structure comparative analysis utility for comparing aligned reads or reconstructed transcripts to a reference isoform structure database (such as using GENCODE annotations for the human transcriptome).

Note, our SQANTI-like annotation system is available for convenience to users of LRAA and not meant to be a substitute to using SQANTI3. Users are encouraged to explore the more feature-filled SQANTI3 system as well.

Our SQANTI-like classifications are illustrated below.

The LRAA SQANTI-like categories are defined as follows:

Multi-exon reads

  • Full Splice Match (FSM) : the full chain of introns spliced from the read match to the full chain of introns spliced in the reference transcript.
  • Incomplete Splice Match (ISM): the chain of introns spliced from the read partially but sequentially match the full chain of introns spliced in the reference transcript.
  • Novel In Category (NIC) : not FSM or ISM but otherwise all splice sites in the read match reference splice sites.
  • Novel Not In Category (NNIC) : some but not all splice sites evident from the read alignment match the reference splice sites (highlighted in the above image by '*')
  • genic : multi-segment read alignment matches exons but no splice site matches the reference.
  • intronic : multi-segment read alignment found entirely within a reference intronic region with no exon overlap.
  • antisense : multi-segment read alignment overlaps a reference exonic region on the opposite strand
  • intergenic : no overlapping of coordinate spans of reference genes

Single-exon reads

  • se_FM : single-exon alignment matches to a single exon reference gene and overlaps by at least 90% of its length.
  • se_IM : single-exon alignment matches to single exon reference gene but fails to meet the se_FM criteria above.
  • se_genic : single-exon alignment matches to a multi-exon reference gene and overlaps introns and exons
  • se_exonic : single-exon alignment matches to a single exon region of a multi-exon gene (usually resulting from partial sequencing matching at 3' ends due to RNA degradation or other incompleteness issue. There are enough of these that they deserve their own category)
  • se_intronic : single-exon alignment matches only within the intron of a multi-exon reference transcript
  • se_antisense : single-exon alignment overlaps an exon of any gene on the opposite strand and doesn't fit in other categories above.
  • se_intergenic : single-exon alignment does not overlap any reference gene span on either strand.

Running comparisons

To capture category assignments for individual reads, run the following:

    LongReadAlignmentAssembler/util/SQANTI-like_cats_for_reads_or_isoforms.py \
           --ref_gtf ${reference_gtf_file} \
           --bam ${minimap2_aligned_reads.bam} \
           --output_prefix sqanti-like

Outputs

Outputs include:

  • ${output_prefix}.iso_cats.tsv.gz : every read name and SQANTI3-like category assignment.
  • ${output_prefix}.iso_cats.summary_counts.tsv.gz : totals for read counts according to each category
  • ${output_prefix}.iso_cats.bam : annotations included in the bam file: CL:Z:${SQANTI3-like_category} CI:Z:${isoform_ids_comma_delim}
  • ${output_prefix}.iso_cats.summary_counts.pdf : plots summarizing the counts according to multi-exon and single-exon read categories.

An example ${output_prefix}.iso_cats.summary_counts.pdf is shown below: