Overview - glarue/intronIC GitHub Wiki

Overview

intronIC has two primary uses, both of which require a genome and corresponding annotation file (or BED file of intron coordinates):

  1. Score all annotated introns against expectations for U12 introns, and classify introns as either U2- or U12-type using a support-vector machine (SVM)-based approach.
  2. Retrieve all annotated intron sequences and associated metadata (using the -s flag).

Classification basics

By default, introns with probability scores >90% are classified as U12-type (which means a relative score >0). This score threshold is likely to be conservative for most species, which hopefully means relatively few false-positives but possibly some number of false-negatives.

One useful sanity check can be to examine the plot.scatter.iic.png and plot.hex.iic.png figures by eye to see whether there is clear separation between putative intron types.

For example, here is the score scatter plot for introns in human (each point is an intron, and the x and y axes are the z-scores for the 5'SS and BPS motifs, respectively), with introns classified as U12-type at a probability >90% in green:

Here are the same kinds of plots for species with significant (D. melanogaster) and complete (C. elegans) minor intron loss:

D. melanogaster C. elegans

The C. elegans plot highlights the presence of a handful of spuriously U12-like introns (red)—previously described in the literature—whose separation from the main cluster of introns is much less clear than in the case of the true positives in D. melanogaster.

Intron scoring applies to unique introns only

Importantly, by default intronIC only processes introns with unique coordinates from the longest annotated isoform for each gene (though this behavior is adjustable). Therefore, the same intron from multiple isoforms will only be included once, and named based upon the longest isoform. See Training data and PWMs and Data filtering notes for additional caveats.

Brief method summary

intronIC should be able to process most annotations (even kinda crappy ones) provided they roughly adhere to GFF3/GTF formatting standards, and produce a set of output files described in detail on the Output files page. In order to work with intronIC, the annotation file must have parent information in the last column, such that all features (CDS or exon) from the same transcript/gene can be associated with one another. Beyond that requirement, the parser should be fairly flexible.

At a high level, inronIC works by aggregating all of the CDS/exon sequences under their parent transcripts and/or genes based on the parent-child relationships given by the last column in the annotation file. CDS features are used preferentially, as they allow intron phase to be determined, but exon features are also used in cases where they define unique introns (unless run with -f cds).

Then, assuming the classification mode is being used, it will score the introns against configurable position-weight matrices, do the same with training data for U2- and U12-type introns, feed that information into an SVM optimization routine and output a set of files containing intron sequences, metadata and classification information.

Basic usage

A typical default scoring run, which will use both CDS and exon features to define introns, and will include introns with non-canonical splice boundaries, might go like this:

intronIC -g {genome} -a {annotation} -n {binomial_name}

To concretize with real file names, for the sample data this would be:

intronIC -g Homo_sapiens.Chr19.Ensembl_91.fa.gz -a Homo_sapiens.Chr19.Ensembl_91.gff3.gz -n homo_sapiens

See Quick start for further instructions on getting set up and checking your installation on test data.