Overview - glarue/intronIC GitHub Wiki

Overview

intronIC is a modular tool for intron extraction and U12-type/U2-type classification. It has two primary uses:

  1. Classification mode: Score all annotated introns against expectations for U12-type introns, and classify introns as either U2- or U12-type using a two-pass RBF SVM ensemble pipeline (v2.7 default: first-pass v4_aug + second-pass v5_modesep_aug, each a 126-model ensemble) on 6 sequence-derived features.
  2. Extraction mode: Retrieve all annotated intron sequences and associated metadata (using intronIC extract).

intronIC supports multiple input modes:

  • Genome + annotation (GFF3/GTF) for comprehensive analysis
  • Genome + BED file for coordinate-based extraction
  • Pre-extracted sequences (.iic format) for classification-only runs

Scientific Background

Minor vs Major Spliceosomes

Eukaryotic pre-mRNA splicing is catalyzed by two distinct spliceosomes:

Major (U2-dependent) spliceosome — Splices ~99.5% of introns

  • Recognizes GT-AG (most common) or GC-AG terminal dinucleotides
  • Uses U1, U2, U4, U5, U6 snRNPs
  • Branch point consensus: loose, A within ~18-40 nt of 3'SS

Minor (U12-dependent) spliceosome — Splices ~0.5% of introns

  • Recognizes both AT-AC (~25%) and GT-AG (~75%) terminal dinucleotides
  • Uses U11, U12, U4atac, U6atac, U5 snRNPs
  • Branch point consensus: highly conserved TCCTTAAC motif, typically 10-15 nt from 3'SS (median ~13 nt)

Despite their rarity, U12-type introns are functionally important and evolutionarily conserved across most eukaryotic lineages. Their loss has been documented in several lineages including C. elegans, certain fungi, and some protists (Alioto 2007; Larue & Roy 2023).

Classification basics

intronIC assigns each intron a binary classification (u12 or u2) based on the raw SVM decision boundary, and an adjusted probability score (0-100%) that is the second-pass mode-separation probability modified by the v2.7 continuous per-intron discount. On the gate-fail path (e.g., U12-absent species), the legacy Bayesian valley-depth + ensemble-agreement adjustment is applied first and then chained through the v2.7 discount. Introns with adjusted scores above the threshold (default 90%) are considered high-confidence U12-type calls. The rel_score column in the output equals adjusted_score - threshold, so positive values indicate confident U12-type calls.

One useful sanity check is to examine the plot.scatter.iic.png and plot.hex.iic.png figures for clear separation between putative intron types.

For example, here is the score scatter plot for introns in human (each point is an intron, and the x and y axes are the z-scores for the 5'SS and BPS motifs, respectively), with introns classified as U12-type at probability >90% in green:

Here are the same kinds of plots for species with significant (D. melanogaster) and complete (C. elegans) minor intron loss:

D. melanogaster C. elegans

The C. elegans plot highlights the presence of a handful of spuriously U12-like introns (red)—previously described in the literature—whose separation from the main cluster of introns is much less clear than in the case of the true positives in D. melanogaster.

Classification method

intronIC uses a seven-stage pipeline (v2.7):

  1. PWM Scoring — Score three key regions against position-weight matrices:

    • 5' splice site (donor): -3 to +9 relative to intron start
    • Branch point: Search window -55 to -5 from 3'SS
    • 3' splice site (acceptor): -6 to +4 relative to intron end
  2. Background Correction — Blend species-specific nucleotide frequencies into U2-type denominator PWMs to correct composition bias (see Technical Details)

  3. Adaptive normalizer fit — Score sampled introns through the corrected PWMs and fit a per-species robust z-scaler (median/IQR) for the first-pass features

  4. First-pass classification — Score every intron with the 126-model cluster-aware RBF SVM ensemble (v4_aug) on 6 features (3 z-scores + support2 + bp_offset + bp_scan_confidence); outputs first_pass_svm

  5. Mode estimation + gate — Estimate per-species μ_U12 and μ_U2 from soft candidate weights derived from first_pass_svm; gate against three checks (n_eff floor, μ_U12 location prior, Fisher-discriminant KDE valley depth)

  6. Second-pass classification (mode-separation) — On gate-pass, re-z-score motif features so that U2 → 0 and U12 → 1 in every species, then score eligible introns through the 126-model v5_modesep_aug ensemble; outputs svm_score. On gate-fail, keep first-pass scores and apply the legacy Bayesian valley-depth + ensemble-agreement adjustment (default for U12-absent species)

  7. Continuous per-intron discount — Apply a non-positive log-odds penalty for SVM overcalls relative to motif log-LR; outputs adjusted_score (the calling column) — see Technical Details

For more details on the algorithm, including normalization, feature augmentation, and the pretrained model architecture, see the Technical Details page.

Intron scoring applies to unique introns only

Importantly, by default intronIC only processes introns with unique coordinates from the longest annotated isoform for each gene (though this behavior is adjustable). Therefore, the same intron from multiple isoforms will only be included once, and named based upon the longest isoform. See Training data and PWMs and Data filtering notes for additional caveats.

Brief method summary

intronIC should be able to process most annotations (even kinda crappy ones) provided they roughly adhere to GFF3/GTF formatting standards, and produce a set of output files described in detail on the Output files page. In order to work with intronIC, the annotation file must have parent information in the last column, such that all features (CDS or exon) from the same transcript/gene can be associated with one another. Beyond that requirement, the parser should be fairly flexible.

At a high level, intronIC works by aggregating all of the CDS/exon sequences under their parent transcripts and/or genes based on the parent-child relationships given by the last column in the annotation file. CDS features are used preferentially, as they allow intron phase to be determined, but exon features are also used in cases where they define unique introns (unless run with -f cds).

Then, assuming the classification mode is being used, it will score the introns against configurable position-weight matrices and classify them using a pretrained SVM model. As of v2.6 the default bundle ships an embedded two-pass classifier: a first-pass cluster-aware ensemble (v4_aug) that produces candidate weights, and a second-pass mode-separation ensemble (v5_modesep_aug) that re-scores eligible introns after per-species recalibration. Both ensembles were trained on multi-species reference data spanning eukaryotic diversity, with intron-type labels assigned by orthology-based comparative genomic analysis across species. The output includes intron sequences, metadata, and classification information. For specialized use cases, users can also train custom models using the train subcommand.

Basic usage

A typical default scoring run uses both CDS and exon features to define introns, includes introns with non-canonical splice boundaries, and uses the built-in pretrained model:

intronIC -g {genome} -a {annotation} -n {binomial_name}

For the sample data:

intronIC -g Homo_sapiens.Chr19.Ensembl_91.fa.gz -a Homo_sapiens.Chr19.Ensembl_91.gff3.gz -n homo_sapiens

Advanced features

intronIC v2 includes several advanced capabilities:

  • YAML configuration files for managing complex parameter sets (see Example usage)
  • Pretrained models with per-species mode-separation recalibration for cross-species classification
  • Continuous per-intron discount (v2.7+) for long-tail false-positive suppression
  • Bayesian valley-depth score adjustment as the gate-fail fallback for U12-absent / non-bimodal species
  • Species-specific U2-type background correction for cross-species composition bias
  • Ensemble training with cross-validation for robust model building (--n-models)
  • Parallel processing for improved performance (-p)
  • Streaming mode (enabled by default) for roughly half the peak memory of --in-memory mode, with bit-identical classifications

Streaming mode (default)

Streaming mode is the default. It writes intron sequences to temporary on-disk storage during extraction and keeps only scoring motifs in memory; the full pipeline (extraction, BG correction, first-pass classification, mode-separation second pass) parallelizes per-contig. Both --streaming and --in-memory produce bit-identical classifications since v2.4 — the choice is purely a memory tradeoff. Reference run on full human GRCh38.p13 + NCBI RefSeq GFF with -p 5 (default v2.7 bundle, 257k scored introns):

Mode Wall time Peak RSS
--streaming (default) ~40 min ~5.3 GB
--in-memory (not re-measured for v2.7; expected similar wall time with roughly 2× peak memory)
# Human genome with streaming (default)
intronIC -g GRCh38.fa.gz -a gencode.gff3.gz -n homo_sapiens -p 8

To disable streaming mode, pass --in-memory. In-memory is also the path used internally by -q (sequence input) and -b (BED input) modes.

See Quick start for further instructions on getting set up and checking your installation on test data.