Technical algorithm - glarue/intronIC GitHub Wiki

Technical Details

This page provides a detailed technical description of intronIC's algorithm, data flow, and machine learning architecture (current as of v2.7).

Pipeline Overview

Input: Genome (FASTA) + Annotation (GFF3/GTF) + Species Name

Note: By default, intronIC uses streaming mode, which processes one chromosome at a time. Streaming and in-memory modes produce bit-identical classifications since v2.4 β€” the choice is purely a memory tradeoff (roughly 50% less RSS on full human at -p 5 for streaming).

  1. Stage 1: Intron Extraction

    • Parse annotation hierarchy (gene β†’ transcript β†’ CDS/exon)
    • Infer intron coordinates from exon gaps
    • Extract intron + flanking exon sequences from genome
    • Filter duplicates, short introns, isoforms (configurable)
  2. Stage 2: PWM Scoring

    • Score 5' splice site, branch point, 3' splice site regions
    • Calculate log-odds ratios: $\log\left(\frac{P(\text{seq}|\text{U12})}{P(\text{seq}|\text{U2})}\right)$
    • Select best branch point position from search window
    • Apply species-specific U2-type background correction (see Species-Specific Background Correction)
  3. Stage 3: Adaptive normalizer fit

    • Score sampled introns through the corrected PWMs
    • Fit a per-species robust RobustScaler (median/IQR) for the first-pass features
    • Fall through to the bundled multispecies fallback scaler for inputs below MIN_ADAPTIVE_INTRONS (default 200) (see Normalization Modes)
  4. Stage 4: First-pass classification

    • 6D feature vector under classical z-scoring: 3 z-scores + support2 + bp_offset + bp_scan_confidence
    • 126-model cluster-aware RBF SVM ensemble (v4_aug)
    • Produces first_pass_svm (probability) and the candidate weights used to estimate per-species U12/U2 motif modes
  5. Stage 5: Mode estimation + gate

    • Compute soft candidate weights $w = \sigma((\text{first_pass_svm} - 90)/5)$ per intron
    • Weighted-median estimates of $\mu_{U12}$ (weights $w$) and $\mu_{U2}$ (weights $1-w$) for each motif feature
    • Three-check gate: $n_{\text{eff}} \geq 5$, $|\mu_{U12,5'} - 15.671| \leq 3.6$, multi-bandwidth Fisher-discriminant KDE valley depth $\geq 0.30$
  6. Stage 6: Second-pass classification (mode-separation)

    • Gate-pass: re-z-score motif features under mode-separation $z = (\text{raw} - \mu_{U2})/(\mu_{U12} - \mu_{U2})$, then score eligible introns ($z_{5'} \geq 0.30$) through the 126-model v5_modesep_aug ensemble. svm_score = second-pass probability; ensemble_sigma = std across sub-models. Ineligible introns retain their first-pass score.
    • Gate-fail: keep first-pass scores; apply the legacy Bayesian valley-depth + ensemble-agreement adjustment (see Score Adjustment) to suppress spurious calls in non-bimodal or U12-absent species.
  7. Stage 7: Continuous per-intron discount (v2.7+)

    • Apply a non-positive log-odds penalty: penalty_overcall = k_overcall Γ— max(0, svm_vs_naive βˆ’ Ο„_overcall) where svm_vs_naive = logit(p_in) βˆ’ raw_sum and raw_sum = 5'_raw + bp_raw + 3'_raw
    • Writes adjusted_score (the calling column); svm_score preserved for auditability
    • Optional weak-motif penalty (disabled by default) can be enabled via --discount-k-weakmot > 0
    • Calls: adjusted_score β‰₯ threshold (default 90)

Output: .meta.iic, .bed.iic, .introns.iic, .score_info.iic, .modesep.json, plots


Stage 1: Intron Extraction

Coordinate Inference

Introns are inferred from gaps between consecutive exons (or CDS features) within the same transcript:

Exon 1        Intron 1          Exon 2          Intron 2          Exon 3
[==]------------------------------[==]------------------------------[==]
1-100          101-1600         1601-1700        1701-3200        3201-3300

Priority: CDS features are preferred over exon features when available, as they enable phase calculation.

Mixed CDS/Exon Handling: For transcripts with both CDS and exon features, introns are first generated from CDS features, then exon-only introns (typically in UTR regions) are added if they don't overlap existing CDS-derived introns. All introns are sorted by genomic position and assigned sequential indices, ensuring proper ordering (e.g., 5' UTR introns are numbered before CDS introns in coding direction).

Touching Exons (Annotation Artifacts): Some annotations contain adjacent exon features with no gap between them (zero-length "introns"). These are silently skipped and not included in the intron count (family_size). The intron index sequence remains contiguous (1, 2, 3...) with no gaps for these annotation artifacts.

Filtering Criteria

Filter Default Description
Duplicates Exclude Same coordinates from multiple isoforms
Longest isoform Keep only Can include all with -i
Minimum length 30 bp Adjustable via --min-intron-len
Ambiguous bases Exclude 'N' in scoring regions
Non-canonical Include Exclude with --no-nc

Stage 2: PWM Scoring

Position Weight Matrices

Position weight matrices (PWMs) capture the probability of observing each nucleotide at each position in a motif. intronIC includes PWMs for:

  • U12-type: AT-AC and GT-AG subtypes
  • U2-type: GT-AG, GC-AG, and AT-AC subtypes

Each subtype has PWMs for all three regions (5' splice site, branch point, 3' splice site). The U2-type AT-AC PWMs prevent inflated log-ratio scores when scoring AT-AC introns, where the U2-type fallback to GT-AG matrices would produce artificially low U2-type scores at the dinucleotide positions.

Scoring Regions

Default scoring windows:

Region Relative to Start End Length Description
5' SS Intron 5' end -3 +9 12 bp Includes last 3 bp of upstream exon
Branch point Intron 3' end -55 -5 (clamped at 3'SS boundary) up to 50 bp Search window, clamped to exclude 3'SS region
3' SS Intron 3' end -6 +4 10 bp Core acceptor only (excludes PPT)

Non-overlapping regions: The branch point search window is clamped at the 3'SS scoring boundary to prevent feature overlap. For a 100 bp intron, the BP search region covers positions 45–93 and the 3'SS covers 94–100.

PWM Selection

For each intron, the terminal dinucleotides (e.g. GT-AG, AT-AC, GC-AG) determine which PWM subtype is used. Both the U12-type and U2-type PWMs are selected to match the intron's dinucleotide class:

  • A GT-AG intron is scored with U12-type GT-AG and U2-type GT-AG matrices
  • An AT-AC intron is scored with U12-type AT-AC and U2-type AT-AC matrices
  • A GC-AG intron is scored with U12-type GT-AG (fallback) and U2-type GC-AG matrices

When either model lacks a direct match for the dinucleotide class (e.g. GC-AG introns use U12-type GT-AG as a fallback, rare boundary types fall back to GT-AG for both models), the fallback PWM's terminal dinucleotide positions are masked so that the score reflects the surrounding motif context rather than a dinucleotide mismatch penalty.

Log-Odds Ratio Calculation

For each region, the raw score is a log-odds ratio computed by scoring the same sequence with both the U12-type and U2-type PWMs:

$$\text{LLR} = \log_2\left(\frac{\prod_{i} P(b_i | \text{U12 matrix})}{\prod_{i} P(b_i | \text{U2 matrix})}\right)$$

Where:

  • $b_i$ is the nucleotide at position $i$
  • The product is taken over all positions in the scoring window
  • Higher positive values favor U12-type
  • Higher negative values favor U2-type
  • Zero means equally likely under both models

For the 5'SS and 3'SS regions, the scoring window is fixed (see table above) and the log-ratio is computed directly.

Branch Point Selection

For the branch point region, position selection and scoring are separate steps:

  1. Position selection: The search window is scanned with a sliding window using the U12-type BPS PWM only. The position with the highest raw U12-type probability product is selected as the putative branch point.
  2. Log-ratio scoring: The same sequence at the selected position is then scored with both the U12-type and U2-type BPS PWMs, and the log-ratio is computed. This ensures the ratio compares both models at the same location.

The branch point adenosine position (bp_offset) is reported relative to the 3' splice site, derived from the PWM's defined reference position (biological position 0 in the matrix). U12-type branch points cluster tightly at -10 to -15 nt from the 3'SS (median -13), while U2-type branch points are more dispersed (-20 to -35 nt) (Pineda & Bradley 2018).

The scan also computes a BPS confidence score (bp_scan_confidence): log2(best_score / mean_score) across all scan positions. This measures how sharply the best match stands out from background β€” high values indicate a well-defined motif (U12-like), low values indicate a flat landscape (U2-like).


Species-Specific Background Correction

Before normalization, intronIC optionally corrects the U2-type denominator PWMs using the species' own nucleotide composition.

In intronIC ≀ v2.3 this step was load-bearing: the default model was trained on human reference data, so non-human species with different composition produced systematically shifted z-score distributions. Background correction reanchored the U2-type log-odds against the target species' own distribution, recovering calibration.

In v2.4 the default model is the v3 multispecies SVM, trained on 41,333 introns drawn from 90 species across 14 broad clades. The original justification for background correction β€” adapting a human-only model to non-human composition β€” no longer applies in the same way. Background correction now plays the role of an inference-time robustness layer for out-of-distribution species: for genomes deep inside the training distribution it is largely redundant with what the multispecies model and the adaptive RobustScaler already handle, but for extreme-composition genomes (e.g., very AT-rich or GC-rich) or clades not represented in training, it remains the cheapest way to keep U2-type log-odds well-anchored. Because the v3 corpus was itself scored with background correction on, disabling it at inference would create a train/inference distribution mismatch and is not recommended for the bundled model.

Method

The correction blends empirical per-position nucleotide frequencies from the target species' intron pool into the U2-type PWMs:

$$\text{PWM}_{\text{corrected}} = w \cdot \text{PWM}_{\text{empirical}} + (1 - w) \cdot \text{PWM}_{\text{prior}}$$

The blending weight is computed per dinucleotide subtype (GT-AG, GC-AG, AT-AC):

$$w = \frac{n_{\text{subtype}}}{n_{\text{subtype}} + n_0}$$

Where $n_{\text{subtype}}$ is the number of introns with that terminal dinucleotide and $n_0$ is a shrinkage parameter (default 1000). Rare subtypes (e.g., AT-AC with a few hundred introns) stay close to the prior PWM; common subtypes (e.g., GT-AG with 200k+ introns) shift toward the species' empirical distribution. Only the U2-type denominator PWMs are modified β€” the U12-type numerator PWMs remain fixed, preserving the signal that defines U12-type introns.

Configuration

Background correction is enabled by default. The key parameter is n0 (Bayesian shrinkage strength); use the same value at classification as was used to score the training corpus (default 1000) for the bundled model.

scoring:
  species_background:
    enabled: true
    n0: 1000          # Shrinkage strength
    min_introns: 500  # Skip correction below this count

Stage 3: Normalization (first-pass; classical z-scoring)

Why Normalize?

Raw log-odds scores have very different ranges and distributions per region. The numbers below come from scoring 257,123 deduplicated introns extracted from Homo sapiens GRCh38 + Ensembl 104 with the bundled v2.4 default settings (post-background-correction; [d]-tagged duplicates and unscored overlap introns excluded):

Region median 1st–99th %ile IQR
5'SS βˆ’41.0 βˆ’63.8 to βˆ’8.9 16.7
BPS βˆ’3.4 βˆ’20.6 to +11.7 9.0
3'SS βˆ’1.3 βˆ’5.6 to +3.2 2.9

Two things to notice. First, the absolute magnitudes are very different β€” the 5'SS IQR alone (β‰ˆ17) is larger than the 3'SS full 1st–99th %ile spread (β‰ˆ9). Without normalization the SVM kernel would be dominated by the 5'SS feature. Second, all three regions are negative-biased because raw scores are $\log\left(\frac{P(\text{seq}|\text{U12})}{P(\text{seq}|\text{U2})}\right)$ and the bulk of introns are U2-type β€” scoring well under U12 PWMs is the rare case. After RobustScaler normalization, all three regions sit on comparable scales (1st–99th %ile roughly βˆ’1.4 to +1.8), and the SVM sees commensurable inputs.

Note (v2.6+): The classical z-scoring described in this section is the first-pass normalization. The second pass uses mode-separation z-scoring that pins the U2 mode to z=0 and the U12 mode to z=1 in every species; see Stage 4: Two-pass classification below. Classical robust z-scoring is what feeds the first-pass classifier and is also what's used on the gate-fail path.

Robust Scaling

intronIC uses robust z-score normalization via sklearn's RobustScaler:

$$z = \frac{\text{raw} - \text{median}}{\text{IQR}}$$

Where IQR is the interquartile range ($Q_{75} - Q_{25}$).

Key properties:

  • Robust to outliers: Median and IQR are resistant to extreme values
  • Centered: Distribution centered around 0 for each feature
  • Comparable scales: All three regions have similar variance after scaling
  • Minimal U12-type contamination: Rare U12-type introns (~0.5% of all introns) do not significantly affect robust statistics

Note: A second normalization step (sklearn StandardScaler) is applied inside the classification pipeline to the full augmented feature vector (base z-scores + composite features + extra features). This ensures all features have zero mean and unit variance before SVM classification.

Normalization Modes

Mode Description Use case
adaptive Refit a fresh RobustScaler on the experimental input v3 default; recommended for any input β‰₯ 200 introns
human Apply a saved (frozen) scaler β€” the v2.3 human-only scaler in v2.3 bundles, or the bundled multispecies fallback in v3 bundles U12-absent / outlier genomes where you want to suppress per-species shift
auto Use the v2.3-style saved scaler if the bundle provides one, otherwise adaptive (with automatic fall-through to the bundled multispecies fallback when the input is too small to fit) Default behavior

Bundled multispecies fallback scaler. Since v2.4.2 the v3 bundle also ships a frozen RobustScaler (v3_fallback_normalizer.pkl) fit on 476,848 raw scores from the 90-species training corpus. It is loaded automatically alongside the model and is used by:

  1. auto / adaptive mode on small inputs: when fewer than MIN_ADAPTIVE_INTRONS (200) introns are scoreable, the per-input RobustScaler fit becomes too noisy (IQR standard error β‰ˆ 20% at n=30 vs β‰ˆ 7% at n=200), so the pipeline transparently falls through to the bundled fallback. This restores single-intron / tiny-annotation scoring (a regression in v2.4.0–v2.4.1).
  2. Explicit --normalizer-mode human: applies the bundled fallback unconditionally β€” useful for U12-absent genomes where you want to suppress any per-species z-score shift.
  3. Explicit --load-normalizer <path>: any saved scaler (the bundled one or one you saved with --save-normalizer) overrides whatever the bundle ships and skips the adaptive pre-pass. Honored by both streaming and in-memory paths since v2.4.2.

Notes:

  • The v2.3 bundle ships its own saved scaler and is unchanged β€” passing --model <v2.3-bundle> reproduces v2.3 behavior exactly.
  • In --streaming mode, the v3 default (auto/adaptive) fits a fresh RobustScaler via a lightweight per-contig pre-pass that scores all introns through the (BG-corrected) PWMs and pools their raw 5'/BP/3' scores before classification. Pre-pass is skipped when the bundled or --load-normalizer scaler is in effect.
  • For reproducible normalization across multiple runs on subsets of the same genome, pass --save-normalizer on the first run and --load-normalizer <path> on subsequent runs.

Stage 4: Two-pass classification

Why two passes?

Cross-species normalized motif separation varies (Ξ”_norm β‰ˆ 3.5 in vertebrates, β‰ˆ 2.8 in plants). A single SVM boundary trained in classical z-space is biased toward whichever clade dominates the training corpus; plants with weaker normalized motif strength get missed even when their PWM signatures are unambiguous U12s. Mode-separation z-scoring fixes the problem by pinning U2 to z=0 and U12 to z=1 in every species β€” but the transform needs $\mu_{U12}$, which requires knowing which introns are U12 candidates. The first pass exists to bootstrap that estimate.

First pass (cluster-aware ensemble)

The first-pass classifier (v4_aug_cluster_aware_2026-05, 126-model RBF SVM ensemble) scores every intron under classical z-scoring. Its outputs are used three ways:

  1. As first_pass_svm on every intron (preserved in score_info.iic).
  2. As candidate weights for mode estimation: $w = \sigma((\text{first_pass_svm} - 90)/5)$.
  3. As the calling score on the gate-fail path (where the second pass is suppressed).

Mode estimation

For each motif feature (5'ss, BP, 3'ss):

  • $\mu_{U12} = \text{weighted-median}(x, w)$
  • $\mu_{U2} = \text{weighted-median}(x, 1 - w)$

Soft weighting (rather than hard binarization at threshold 90) keeps the estimates well-behaved when first-pass calls are noisy.

Gate

Three checks determine whether the second pass runs:

Check Threshold Failure reason
$n_{\text{eff}}$ candidates $\geq 5$ below_n_floor
$\mu_{U12,5'}$ offset from cross-species anchor (15.671 raw PWM units) $\leq \pm 3.6$ u12_mode_outside_prior_range
Multi-bandwidth Fisher-discriminant KDE valley depth $\geq 0.30$ no_kde_valley

The valley-depth check uses the same Fisher's linear discriminant + multi-bandwidth KDE described under Species-Level Cluster Validation. The location-prior tolerance Β±3.6 raw PWM units was derived empirically from 61 U12-bearing species across 12 phyla ($\mu_{U12,5',raw} = 15.45 \pm 1.02$; $3 \sigma + 0.5\sigma$ buffer); it passes all 61 training-corpus species and catches adversarial first-pass failure cases that would otherwise produce hundreds of spurious U12 calls in U12-absent ciliates.

Second pass (mode-separation ensemble)

On gate-pass species, motif features are re-z-scored under mode-separation: $$z_{\text{modesep}} = \frac{\text{raw} - \mu_{U2,\text{species}}}{\mu_{U12,\text{species}} - \mu_{U2,\text{species}}}$$ The three other features (bp_offset, bp_scan_confidence, support2) pass through unchanged. Eligible introns ($z_{5'} \geq 0.30$) are then scored through the second-pass v5_modesep_aug_C200_g0.001_v2.6 ensemble (126 models, RBF SVM, same architecture as the first pass). Introns with $z_{5'} &lt; 0.30$ cannot reach the U12 threshold regardless of other features, so they retain their first-pass score.

svm_score in score_info.iic is the second-pass probability for modesep_route == "modesep" introns and the first-pass probability for modesep_route == "untouched" introns.

On gate-fail species, the second pass is suppressed and the legacy Bayesian valley-depth + ensemble-agreement adjustment (described under Score Adjustment) is applied to first-pass scores.

Diagnostic JSON output (.modesep.json)

Each run writes a per-species .modesep.json sidecar alongside score_info.iic. Top-level fields:

{
  "route": "modesep" | "first_pass_fallback",
  "gate_reason": "ok" | "below_n_floor" | "u12_mode_outside_prior_range" | "no_kde_valley" | "degenerate_separation",
  "quality_tier": "A" | "B" | "C" | "F",
  "n_introns": 257123,
  "n_eligible": 24433,
  "n_called_u12": 762,
  "n_eff_candidates": 713.3,
  "valley_depth": 0.757,
  "mu_u2_5p": -42.05,
  "mu_u12_5p": 14.16,
  "mu_u12_5p_offset": -1.51,
  "median_ensemble_sigma_called": 0.0,
  "p90_ensemble_sigma_called": 0.075,
  "first_pass_model_id": "v4_aug_cluster_aware_2026-05",
  "second_pass_model_id": "v5_modesep_aug_C200_g0.001_v2.6",
  "boundary_mass": 0.0021,
  "n_called_pre_discount": 762,
  "n_called_post_discount": 762,
  "continuous_discount_applied": true
}

route records whether the second pass ran (modesep) or whether the run fell back to first-pass scores (first_pass_fallback). gate_reason records why the gate refused on a fallback run.

Quality tiers

A four-tier rubric flags borderline runs so users don't need to interpret the numerical fields directly:

Tier Conditions Interpretation
A gate-pass + valley β‰₯ 0.5 + n_eff β‰₯ 20 + median Οƒ ≀ 0.10 Clean, high-confidence call set
B gate-pass + (valley β‰₯ 0.3 or n_eff β‰₯ 10) + median Οƒ ≀ 0.15 Acceptable; small sample or marginal valley
C gate-pass but valley < 0.3 or n_eff < 10 or median Οƒ > 0.15 Mode-sep applied but review recommended
F gate-fail First-pass scores emitted; no second-pass recalibration. Common for U12-absent species (Saccharomyces, Schizosaccharomyces, microsporidia, ciliates, Caenorhabditis)

boundary_mass (diagnostic only)

boundary_mass is the fraction of eligible introns whose second-pass ensemble mean $P$ falls in $[0.1, 0.9]$. It is a diagnostic for SVM uncertainty over the call set β€” it has no role in the gate or the v2.7 discount. Healthy in-distribution species sit at $\leq 1%$ (HomSap ~0.2%, DroMel ~0.1%); elevated values flag distributions where the SVM is uncertain over a non-trivial fraction of candidates.

Feature Space

The default model uses 6 features (6D):

Base features (3D):

  • s5z: 5' splice site z-score
  • BPz: Branch point z-score
  • s3z: 3' splice site z-score (core acceptor, -6 to +3)

Composite feature (1D):

  • support2: second-largest of max(0, s5z), max(0, BPz), max(0, s3z) β€” encodes "at least two scoring regions positively support U12-type classification." Zero when only one site is positive.

Extra features (2D):

  • bp_offset: branch point adenosine distance from 3'SS (negative integer). U12-type introns cluster at -10 to -15; U2-type at -20 to -35.
  • bp_scan_confidence: log2(best/mean) of U12-type PWM scores across all BP scan positions. Measures motif sharpness; U12-type introns produce sharper peaks than U2-type.

RBF SVM

The default model uses an RBF (radial basis function) kernel SVM (sklearn's SVC):

  • Kernel: RBF (nonlinear boundary, can learn local patterns)
  • Hyperparameters: C = 200, Ξ³ = 0.001 (v2.4 default; previously C = 35.11, Ξ³ = 0.01 in v2.3); selected via joint grid search on the multispecies training corpus
  • Class weights: Balanced to handle ~0.5% U12-type intron prevalence

The RBF kernel allows the classifier to learn curved boundaries β€” for example, requiring both strong donor AND strong BP for confident U12-type calls, rather than allowing one to compensate for the other as a linear model would.

Probability Calibration

The raw SVM outputs decision distances (signed distance from hyperplane). These are converted to probabilities using sklearn's CalibratedClassifierCV. During training, both sigmoid (Platt scaling) and isotonic calibration are evaluated via cross-validation, and the method with lower log-loss is selected.

  • Sigmoid (Platt scaling): Fits a logistic function with 2 parameters β€” more conservative probability estimates, fewer borderline calls
  • Isotonic: Non-parametric monotonically increasing step function β€” more flexible, typically selected when sufficient training data is available

The default pretrained model uses isotonic calibration.

Ensemble Training

When --n-models > 1, multiple SVMs are trained with different U2-type subsamples:

The v2.7 default bundle ships two 126-model RBF SVM ensembles, each built from 3 independent random seeds Γ— 42 sub-models per seed: the first-pass v4_aug_cluster_aware_2026-05 (classical z-scoring) and the second-pass v5_modesep_aug_C200_g0.001_v2.6 (mode-separation z-scoring). Each sub-model sees all U12-type references but a different random 75% subset of U2-type references. Final predictions are the mean probability across all sub-models in the active ensemble, and the standard deviation across the second-pass ensemble (ensemble_sigma) quantifies model agreement on modesep_route == "modesep" introns. (For historical context: the v2.3 default used a single-seed 42-model ensemble in a single-pass pipeline; v2.4 lifted that to a single 126-model ensemble; v2.6 introduced the two-pass mode-separation design.)


Training the Default Model

The v2.7 default bundle ships two embedded ensembles, both trained on multi-species reference data:

First-pass ensemble: v4_aug_cluster_aware_2026-05

Cluster-aware corpus rebuild on the multispecies v3 reference set with per-intron SVM scores feeding the cluster-membership augmentation. 126-model RBF SVM ensemble (3 seeds Γ— 42 sub-models), C = 200, Ξ³ = 0.001, easy_fraction = 0.75. Operates on classical z-scoring of the 6D feature set.

Second-pass ensemble: v5_modesep_aug_C200_g0.001_v2.6

Trained on multispecies_corpus_v5_modesep.tsv (502,921 training rows), with motif features re-z-scored under mode-separation against per-species $\mu_{U12}$ and $\mu_{U2}$ estimates derived from first-pass scores during corpus build. 126-model RBF SVM ensemble (3 seeds Γ— 42 sub-models), same hyperparameters as the first pass (C = 200, Ξ³ = 0.001, easy_fraction = 0.75; re-verified via 6-configuration HP sweep against the inherited values, with Ξ³ = 0.0001 marginally winning at 50K-row subsample but losing 4 TPs on the full corpus β€” inherited HPs retained). The PWMs are unchanged from v2.3.

Mode-separation phase 0 derivations

Several constants in the modesep bundle were derived empirically against held-out species panels:

  • z_floor_eligibility = 0.30 β€” $z_{5'}$ eligibility floor for the second pass. Introns below this cannot reach the U12 threshold; setting it higher loses recall, lower wastes second-pass compute without changing calls.
  • valley_depth_min = 0.30 β€” Gate threshold for multi-bandwidth Fisher-discriminant KDE valley depth.
  • mu_u12_5p_tolerance = 3.6 raw PWM units β€” Β±tolerance on $\mu_{U12,5'}$ from the cross-species anchor 15.671. Derived from 61 U12-bearing species across 12 phyla ($\mu_{U12,5',raw} = 15.45 \pm 1.02$; $3 \sigma + 0.5 \sigma$ buffer).
  • n_floor_candidates = 5 β€” Minimum $n_{\text{eff}}$ candidates for the gate.

Hyperparameter optimality verification (v2.7 release)

Before v2.7 release, the inherited SVM hyperparameters (C = 200, Ξ³ = 0.001, easy_fraction = 0.75 β€” carried over from v4_aug_cluster_aware) were empirically verified against 6 neighbor configurations. Subsample sweep (50K-row corpus, 14-species panel) showed Ξ³ = 0.0001 marginally edging out control by +4 TPs (1934 vs 1930); C/2, 2C, ezfΒ±0.15 statistically tied; Ξ³ = 0.01 lost significantly (βˆ’24 TPs). Full-corpus verification (502K rows, Ξ³ = 0.0001 only) showed Ξ³ = 0.0001 lost 4 TPs vs control on full corpus (1946 vs 1950) with a bundle-size penalty (119 MB vs 78 MB); the subsample-to-full-corpus signal failed. Conclusion: current HPs are at or near the global optimum within the tested neighborhood.

Training corpus

  • 41,333 introns drawn from 90 species across 14 broad evolutionary clades (vertebrates, arthropods, nematodes, monocots, eudicots, fungi, basal animals, protists, and others); a further 7 species are reserved as evaluation-only holdouts (this corpus is the basis of the first-pass ensemble; the second-pass corpus extends from it).
  • 10,003 U12-type positives + 31,330 U2-type and hard-negative records.
  • Per-intron labels are assigned by comparative genomic analysis: each intron's classification is supported by orthology evidence from multiple species, rather than a single reference annotation.
  • As of v2.4.1 the actual training set is bundled at u{12,2}_reference_multispecies.introns.iic.gz (full intron sequences with 50 bp flanks); the legacy human-anchored v2.3 references remain available as u{12,2}_reference_human.introns.iic.gz.
  • Branch point positions are derived from CoLa-seq empirical data (Zeng et al. 2022); all PWMs are unchanged from v2.3.

Calibration

Both ensembles use isotonic regression for probability calibration (selected over sigmoid by log-loss during training).

Evaluation

  • 14-species IPA-validated panel (v2.7): TP = 1,950, FN = 3, FP_strong = 0, FP_any = 216. Pareto improvement over v2.4 (single-pass v3 multispecies) on plant recall: AmbTri 90% β†’ 100% (51/51), OrySat 94% β†’ 100% (32/32), AraTha 96% β†’ 98% (47/48); Apostasia IPA-validated FNs 17/21 β†’ 20/21.
  • 10-fold leave-clade-out cross-validation on the first-pass corpus: every clade F1 β‰₯ 0.98.
  • U12-absent species (CaeEle, SacCer, SchPom, TetThe, ChlRei, AscSuu) gate-fail cleanly to first-pass scores; zero FP_strong leakage.

Prior bundles remain accessible by passing --model <path-to-bundle>:

  • v2.3 bundles: single-pass, classical z-scoring with the saved human-only frozen scaler.
  • v2.4 / v3 bundles: single-pass, multispecies cluster-aware classifier with the adaptive scaler (and bundled multispecies fallback for small inputs).
  • v2.6 bundles: two-pass mode-separation (the v2.7 bundle is the same mode-sep architecture; v2.7 adds the continuous discount as a post-processing layer).

Streaming mode (--streaming, the default) supports all bundle formats: v2.3 bundles use their saved frozen scaler; v2.4+ bundles trigger a lightweight per-contig pre-pass that fits a fresh adaptive RobustScaler before classification, falling through to a bundled multispecies fallback scaler when the input is too small for a stable per-input fit (see Normalization Modes).


Species-Level Cluster Validation

The same valley-depth detector is used in two places: (a) as one of the three checks in the mode-separation gate (Stage 5), where a valley depth $&lt; 0.30$ refuses the second pass; and (b) on the gate-fail path as part of the legacy Bayesian score adjustment (described in the next section). When mode-separation runs and the gate passes, the valley depth is reported as a diagnostic (in .modesep.json) but is no longer load-bearing for the calling decision.

Valley depth metric

The algorithm projects all intron scores onto a 1D discriminating axis in (5'z, BPz, 3'z) space, then estimates the density along this axis using kernel density estimation (KDE). The projection direction is Fisher's linear discriminant:

$$\mathbf{w} = \boldsymbol{\Sigma}^{-1} (\boldsymbol{\mu}_{\text{U12}} - \boldsymbol{\mu}_{\text{U2}})$$

where $\boldsymbol{\Sigma}$ is the diagonal-shrunk pooled within-class covariance. This is the optimal 1D projection under Gaussian assumptions: it down-weights features with high within-class variance β€” particularly 3'z, which carries substantial within-U12 variance from AT-AC vs GT-AG subtype differences. (Earlier 2D versions of the metric used a naive centroid-difference direction on (5'z, BPz) only and excluded 3'z; the Fisher reweighting absorbs that within-class noise instead of inheriting it into the projection, so adding 3'z becomes a net win β€” typically +2–10% valley depth on real-U12 species.)

A valley depth is then computed as the proportional drop in density between the U2-type and U12-type clusters:

$$\text{valley depth} = 1 - \frac{\rho_{\min}}{\min(\rho_{U2},, \rho_{U12})}$$

where $\rho_{U2}$ and $\rho_{U12}$ are the kernel density values at the U2-type and U12-type peaks along the centroid axis, and $\rho_{\min}$ is the minimum density value in the gap between them. Values range from 0 (no separation) to 1 (complete separation). To ensure robustness, the density is evaluated at multiple KDE bandwidths (0.5Γ— to 5Γ— Silverman's rule) and the median depth across bandwidths is reported. This prevents both over-smoothing (which hides real valleys) and under-smoothing (which creates spurious valleys from noise).

A valley depth > 0.3 is considered evidence of bimodal separation. The metric is reported in the log output along with centroid_Οƒ (the distance between U12-type and U2-type centroids in U2-type standard deviation units).

Interpretation

Valley depth Interpretation Example species (depth; n confident U12-type)
> 0.9 Strong bimodal separation β€” genuine U12-type population H. sapiens (0.94; n=741), D. melanogaster (0.94; n=18), I. scapularis (0.93; n=416)
0.5–0.9 Moderate separation β€” likely real but smaller U12-type population A. castellanii (0.59; n=13), V. jacobsoni (0.75; n=5)
< 0.3 No valley β€” U12-type calls do not form a distinct cluster C. elegans (0.00; n=0), A. suum (0.00; n=1)

When no valley is detected, intronIC emits a warning:

⚠ No density valley detected (depth=0.000): U12-type intron calls may not form a
distinct cluster. Consider reviewing calls with caution.

This does not prevent classification β€” individual intron scores are still valid β€” but signals that the species may lack a distinct U12-type intron population.


Score Adjustment

The path from svm_score to the calling column adjusted_score depends on whether the mode-separation gate passed:

  • Gate-pass species (Stage 6 ran the second pass): only the v2.7 continuous discount fires (see v2.7 continuous discount below).
  • Gate-fail species (U12-absent, non-bimodal, or with adversarial first-pass mode estimates): the legacy Bayesian valley-depth + ensemble-agreement adjustment is applied to first-pass scores first; the result then feeds the v2.7 discount.

Legacy Bayesian adjustment (gate-fail path)

The legacy adjustment operates in log-odds space and incorporates two independent signals β€” species-level population evidence and per-intron model agreement:

$$\text{logit}(p_{\text{adj}}) = \text{logit}(p_{\text{svm}}) + \log\left(\frac{\pi_{\text{species}}}{0.5}\right) - k_\sigma \cdot \sigma$$

Where:

  • $p_{\text{svm}}$: raw first-pass ensemble mean probability (isotonic-calibrated)
  • $\pi_{\text{species}}$: species-level prior from valley depth
  • $\sigma$: ensemble standard deviation (std of per-model probabilities)
  • $k_\sigma = 3.0$: disagreement penalty coefficient

Valley-to-prior mapping

The species prior is derived from valley depth via a sigmoid:

$$\pi_{\text{species}} = \pi_{\text{floor}} + \frac{0.5 - \pi_{\text{floor}}}{1 + \exp\left(-\frac{4.394}{w} \cdot (d - m)\right)}$$

Where $d$ is valley depth, $m$ is the midpoint (default 0.3), $w$ is the transition width (default 0.25), and $\pi_{\text{floor}}$ is the minimum prior (default 0.001).

Valley depth Prior ($\pi$) Effect on a 99% SVM score
0.0 0.001 Adjusted to ~9%
0.15 ~0.03 Adjusted to ~75%
0.30 ~0.25 Adjusted to ~98%
0.65+ ~0.50 No change

For species with strong U12-type intron populations (valley > 0.5), the adjustment is negligible. For species lacking U12-type introns (valley near 0), even high SVM scores are discounted below the threshold.

Effect on output

  • adjusted_score in score_info.iic: the post-adjustment probability (after the v2.7 continuous discount; see below)
  • rel_score in meta.iic and score_info.iic: adjusted_score - threshold
  • type_id (U12/U2): unchanged β€” still set by the raw SVM decision boundary
  • svm_score in score_info.iic: unchanged β€” still the second-pass ensemble mean (or first-pass on the gate-fail / modesep_route == "untouched" path)

Configuration

scoring:
  threshold: 90.0
  score_adjustment:
    enabled: true
    valley_midpoint: 0.3
    transition_width: 0.25
    prior_floor: 0.001
    k_sigma: 3.0

v2.7 continuous discount (per-intron overcall penalty)

After Stage 6 (gate-pass or gate-fail), a non-positive log-odds penalty is applied to every intron in score_info.iic. This is what produces the adjusted_score column on the gate-pass path, and what further trims adjusted_score on the gate-fail path.

$$\text{penalty}_{\text{overcall}} = k_{\text{overcall}} \cdot \max\bigl(0, \text{svm_vs_naive} - \tau_{\text{overcall}}\bigr)$$ $$\text{penalty}_{\text{weakmot}} = k_{\text{weakmot}} \cdot \max\bigl(0, \tau_{\text{motif}} - \text{raw_sum}\bigr)$$ $$\text{logit}(\text{adjusted_score}) = \text{logit}(p_{\text{in}}) - \text{penalty}_{\text{overcall}} - \text{penalty}_{\text{weakmot}}$$

Where svm_vs_naive = logit(p_svm) - raw_sum, raw_sum = 5'_raw + bp_raw + 3'_raw, and $p_{\text{in}}$ is svm_score on the gate-pass path or the legacy-adjusted score on the gate-fail path.

Empirically-tuned defaults (14-species IPA-validated panel + Salpingoeca):

Constant Default Notes
$k_{\text{overcall}}$ 2.0 Tunable via --discount-k-overcall
$\tau_{\text{overcall}}$ 0.0 Tunable via --discount-tau-overcall
$k_{\text{weakmot}}$ 0.0 (DISABLED) Opt-in via --discount-k-weakmot
$\tau_{\text{motif}}$ 10.0 Tunable via --discount-tau-motif

The overcall penalty alone suppresses long-tail Salpingoeca-class overcalls (29 β†’ 6 pre-legacy) while preserving all 1,950 IPA-validated panel TPs. The weak-motif penalty is opt-in because at default settings it loses 4 IPA-validated borderline TPs (SUDS3, ARPC5, ap4e1, mios) where the SVM correctly leverages non-motif features (bp_offset, bp_scan_confidence, support2) for borderline-motif U12s.

Disable the discount entirely with --no-continuous-discount (reproduces v2.6 behavior: adjusted_score == svm_score on gate-pass, legacy-adjusted on gate-fail).


Species-Specific Considerations

GC Content Effects

Species with different GC content than human may show shifted score distributions. Options:

  1. Adaptive normalization: Refit scaler on experimental data (--normalizer-mode adaptive)
  2. Prior adjustment: Adjust base rate expectation (--species-prior)

U12-Absent Lineages

The mode-separation gate automatically handles species lacking U12-type introns: when there is no detectable second-pass U12 mode (n_eff below floor, or no KDE valley in the Fisher-discriminant projection), the gate fails and the second pass is suppressed. The legacy Bayesian valley-depth + ensemble-agreement adjustment is then applied to first-pass scores, discounting them below the confidence threshold. Empirically, C. elegans, S. cerevisiae, S. pombe, T. thermophila, C. reinhardtii, and A. suum all gate-fail cleanly to first-pass scores with zero FP_strong leakage.

No special configuration is needed for species lacking U12-type introns. The v2.7 continuous discount applies on top of the gate-fail legacy adjustment, providing an additional defense in depth against long-tail loose-or-NA introns.

Cross-Species Performance

The default v2.6+ multispecies + mode-separation model generalizes broadly across eukaryotes. The per-species mode-separation step was specifically designed to fix the v2.4 weakness on lineages with smaller normalized U12/U2 motif separation: AmbTri 90% β†’ 100% (51/51), OrySat 94% β†’ 100% (32/32), AraTha 96% β†’ 98% (47/48), Apostasia IPA-validated FNs 17/21 β†’ 20/21. Performance may still degrade for:

  • Lineages with unusual U12-type motifs that fall outside the training distribution
  • Species with extreme GC bias and very small intron pools (where the adaptive RobustScaler becomes noisy and the bundled multispecies fallback applies; see Normalization Modes)
  • U12-bearing species with insufficient first-pass U12 candidates to estimate $\mu_{U12}$ confidently β€” these gate-fail to first-pass scores

For species suspected to fall outside the training distribution, inspect .modesep.json quality tier (A/B/C/F) and consider providing custom reference sequences via intronIC train if necessary.


Memory and Performance

--streaming (default) and --in-memory produce bit-identical classifications since v2.4 (covered by tests/integration/test_streaming_equivalence.py); the choice is purely a runtime/memory tradeoff.

Streaming Mode (--streaming, default)

Per-contig pipeline:

  • Writes intron sequences to a temporary SQLite database during extraction
  • Keeps only scoring motifs in memory
  • Each phase (extraction, BG correction, adaptive normalizer fit, classification) parallelizes across contigs via multiprocessing.Pool

In-memory Mode (--in-memory)

Loads all intron sequences into memory at extraction time. Used internally by --sequences and --bed input modes (those bypass the per-contig streaming path).

Reference benchmark

v2.7 sequential timing on commodity hardware (single workstation, NVMe SSD), default v2.7 bundle (mode-sep two-pass + continuous discount), -p 5 --streaming:

Species Genome / annotation Scored introns Wall time Peak RSS
Drosophila melanogaster Release 6 + ISO1_MT (~140 Mb) 47,000 ~8 min ~0.8 GB
Homo sapiens GRCh38.p13 + NCBI RefSeq GFF (~3.2 Gb) 257,123 ~40 min ~5.3 GB

--in-memory was not re-measured for v2.7. Based on the v2.4 ratio, in-memory is expected to finish at essentially the same wall time on multi-contig genomes with roughly 2Γ— the peak memory; streaming and in-memory remain bit-identical (covered by tests/integration/test_streaming_equivalence.py).

The HomSap wall-time growth from v2.4 (~16 min at -p 6) to v2.7 (~40 min at -p 5) is dominated by the two-pass mode-separation architecture: both ensembles are 126-model RBF SVMs and the second-pass v5_modesep_aug ensemble has substantially more support vectors than the v2.4 single-pass model (~90k total SVs across the ensemble at C = 200, Ξ³ = 0.001). The cluster-aware first pass scores every intron; the mode-sep second pass scores only the ~10% that meet the $z_{5'} \geq 0.30$ eligibility floor.

Parallelization

The -p N flag parallelizes PWM scoring and per-contig extraction:

  • Scoring is CPU-bound and parallelizes efficiently
  • Linear speedup up to ~8-16 cores
  • Diminishing returns beyond that
  • BG correction, adaptive-fit, and classify all dispatch through Pool workers in streaming mode

References

Original intronIC paper:

Moyer DC, Larue GE, Hershberger CE, Roy SW, Padgett RA. (2020) Comprehensive database and evolutionary dynamics of U12-type introns. Nucleic Acids Research 48(13):7066–7078. doi:10.1093/nar/gkaa464

U12-type intron databases:

Larue GE, Roy SW. (2023) Where the minor things are: a pan-eukaryotic survey suggests neutral processes may explain much of minor intron evolution. Nucleic Acids Research 51(20):10884-10908. doi:10.1093/nar/gkad797

Moyer DC, Larue GE, Hershberger CE, Roy SW, Padgett RA. (2020) Comprehensive database and evolutionary dynamics of U12-type introns. Nucleic Acids Research 48(13):7066-7078. doi:10.1093/nar/gkaa464

Alioto TS. (2007) U12DB: a database of orthologous U12-type spliceosomal introns. Nucleic Acids Research 35:D110-D115. doi:10.1093/nar/gkl842

Branch point mapping and spliceosome profiling:

Burke JE, Longhurst AD, Merkurjev D, Sales-Lee J, Rao B, Moresco JJ, Yates JR III, Li JJ, Madhani HD. (2018) Spliceosome profiling visualizes operations of a dynamic RNP at nucleotide resolution. Cell 173(4):1014-1030.e17. doi:10.1016/j.cell.2018.03.020

Zeng Y, Fair BJ, Zeng H, Krishnamohan A, Hou Y, Hall JM, Ruthenburg AJ, Li YI, Staley JP. (2022) Profiling lariat intermediates reveals genetic determinants of early and late co-transcriptional splicing. Molecular Cell 82(24):4681-4699. doi:10.1016/j.molcel.2022.11.004

Mercer TR, et al. (2015) Genome-wide discovery of human splicing branchpoints. Genome Research 25:290-303. doi:10.1101/gr.182477.114

Pineda JMB, Bradley RK. (2018) Most human introns are recognized via multiple and tissue-specific branchpoints. Genes & Development 32(7-8):577-591. doi:10.1101/gad.312058.118

U12-type intron retention and functional studies:

NiemelΓ€ EH, et al. (2014) Global analysis of the nuclear processing of transcripts with unspliced U12-type introns by the exosome. Nucleic Acids Research 42(11):7358-7369. doi:10.1093/nar/gku391

Madan V, et al. (2015) Aberrant splicing of U12-type introns is the hallmark of ZRSR2 mutant myelodysplastic syndrome. Nature Communications 6:6042. doi:10.1038/ncomms7042

Cologne A, et al. (2019) New insights into minor splicingβ€”a transcriptomic analysis of cells derived from TALS patients. RNA 25(9):1130-1149. doi:10.1261/rna.071423.119

SVM probability calibration:

Platt JC. (1999) Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers pp. 61-74.

⚠️ **GitHub.com Fallback** ⚠️