Overview - glarue/intronIC GitHub Wiki
Overview
intronIC
has two primary uses, both of which require a genome and corresponding annotation file (or BED file of intron coordinates):
- Score all annotated introns against expectations for U12 introns, and classify introns as either U2- or U12-type using a support-vector machine (SVM)-based approach.
- Retrieve all annotated intron sequences and associated metadata (using the
-s
flag).
Classification basics
By default, introns with probability scores >90% are classified as U12-type (which means a relative score >0). This score threshold is likely to be conservative for most species, which hopefully means relatively few false-positives but possibly some number of false-negatives.
One useful sanity check can be to examine the plot.scatter.iic.png
and plot.hex.iic.png
figures by eye to see whether there is clear separation between putative intron types.
For example, here is the score scatter plot for introns in human (each point is an intron, and the x and y axes are the z-scores for the 5'SS and BPS motifs, respectively), with introns classified as U12-type at a probability >90% in green:
Here are the same kinds of plots for species with significant (D. melanogaster) and complete (C. elegans) minor intron loss:
D. melanogaster | C. elegans |
---|---|
The C. elegans plot highlights the presence of a handful of spuriously U12-like introns (red)—previously described in the literature—whose separation from the main cluster of introns is much less clear than in the case of the true positives in D. melanogaster.
Intron scoring applies to unique introns only
Importantly, by default intronIC
only processes introns with unique coordinates from the longest annotated isoform for each gene (though this behavior is adjustable). Therefore, the same intron from multiple isoforms will only be included once, and named based upon the longest isoform. See Training data and PWMs and Data filtering notes for additional caveats.
Brief method summary
intronIC
should be able to process most annotations (even kinda crappy ones) provided they roughly adhere to GFF3/GTF formatting standards, and produce a set of output files described in detail on the Output files page. In order to work with intronIC
, the annotation file must have parent information in the last column, such that all features (CDS
or exon
) from the same transcript/gene can be associated with one another. Beyond that requirement, the parser should be fairly flexible.
At a high level, inronIC
works by aggregating all of the CDS/exon sequences under their parent transcripts and/or genes based on the parent-child relationships given by the last column in the annotation file. CDS
features are used preferentially, as they allow intron phase to be determined, but exon
features are also used in cases where they define unique introns (unless run with -f cds
).
Then, assuming the classification mode is being used, it will score the introns against configurable position-weight matrices, do the same with training data for U2- and U12-type introns, feed that information into an SVM optimization routine and output a set of files containing intron sequences, metadata and classification information.
Basic usage
A typical default scoring run, which will use both CDS
and exon
features to define introns, and will include introns with non-canonical splice boundaries, might go like this:
intronIC -g {genome} -a {annotation} -n {binomial_name}
To concretize with real file names, for the sample data this would be:
intronIC -g Homo_sapiens.Chr19.Ensembl_91.fa.gz -a Homo_sapiens.Chr19.Ensembl_91.gff3.gz -n homo_sapiens
See Quick start for further instructions on getting set up and checking your installation on test data.