Training data and PWMs - glarue/intronIC GitHub Wiki
Training data and PWMs
Default data files can be found in the data directory. They consist of the following:
v2.7 default model. As of v2.6+, the bundled default model (
default_pretrained.model.pkl) is a two-pass classifier: a first-pass cluster-aware ensemble (v4_aug_cluster_aware_2026-05) and a second-pass mode-separation ensemble (v5_modesep_aug_C200_g0.001_v2.6). Each ensemble is 126 models (3 seeds × 42 sub-models). The first-pass corpus is the v3 multispecies set (41,333 introns from 90 training species across 14 evolutionary clades, with 7 additional holdout species reserved for evaluation; intron-type labels assigned by comparative genomic analysis across orthologs). The second-pass corpus (multispecies_corpus_v5_modesep.tsv, 502,921 rows) extends the v3 set with first-pass-derived per-species mode estimates used to re-z-score the motif features.PWMs. The bundled
intronIC_scoring_PWMs.jsonis unchanged from v2.3 — built from a ~580-intron human + IPA-conserved gold standard for the 5'/3'SS PWMs and from CoLa-seq empirical branch-point positions (Zeng et al. 2022) for the 12 bp BPS PWMs. The v3 effort happened entirely at the SVM layer.The decision not to rebuild PWMs from the multi-species corpus was empirical: a head-to-head comparison built fresh per-subtype PFMs from the multispecies gold standard and compared them to the bundled defaults via per-position Hellinger distance. Mean divergences were small (~0.04–0.08 for U12 GT-AG and AT-AC across 5'SS / BPS / 3'SS), the consensus base was unchanged at every diverging position (only frequencies shifted slightly), and cross-clade variation within the multispecies set sat at only 1.2–1.7× the within-clade noise floor — much of it driven by small-sample clades. The AT-AA U12 subtype was also evaluated as a candidate for a subtype-specific PWM via a two-stage screen (per-position discrimination vs the
u12_gtagfallback, plus a recall scan of how many AT-AA boundary introns sit just below the U12 threshold). AT-AA discriminates moderately at 5'SS (1.57× over noise) and 3'SS (1.43×) but not at BPS (0.96×, indistinguishable from GT-AG). Recall impact is small in absolute terms — ~0.03% of all U12 calls genome-wide and a median of <2 introns per species — so an AT-AA-only PWM is deferred until at least one other non-canonical subtype passes the same screen, at which point the dispatch overhead is amortized and they can be bundled together.Training corpus. As of v2.4.1, the v3 training corpus is bundled at
u12_reference_multispecies.introns.iic.gz(10,003 U12 introns) andu2_reference_multispecies.introns.iic.gz(31,330 U2 introns) — the actual 41,333-row training pool the v3 model was fit on (90 species, 14 clades, post-singleton-decay filter). These are the new defaults forintronIC train. The legacy human-anchored v2.3 reference sets remain bundled asu12_reference_human.introns.iic.gzandu2_reference_human.introns.iic.gz— pass--reference-u12s/--reference-u2sto point at them or any custom sets. Bundling sequences (rather than post-scoring feature TSVs) keeps the corpus robust to retraining choices: a user changing the BPS search window, flanking length, or PWM file can re-derive features from the same introns. Multispecies sequences ship with 50 bp flanks per side (trimmed from the per-species v2.3 runs); this is plenty for the default--five-score-coords -3 9and--three-score-coords -6 4windows, with headroom for any reasonable adjustment.See Technical algorithm — Training the Default Model for full v2.6+ training details and evaluation numbers.
Reference intron sequences
Intron sequences (including any flanking exon sequence required for scoring) used to establish classification boundaries. These should contain representative introns of both types for whatever species is being analyzed to produce the best results (although the default set should work well in many cases).
Two reference set pairs are bundled as of v2.4.1; intronIC train defaults to the multispecies set.
[u2, u12]_reference_multispecies.introns.iic.gz (v2.4.1 default)
The v3 multispecies training corpus the bundled default_pretrained.model.pkl was fit on:
- U12-type (10,003 introns) drawn from 90 species across 14 evolutionary clades, with type labels assigned by comparative genomic analysis across orthologs (Intron Position Analysis / IPA). Inclusion criteria: ≥3 species per ortholog group across ≥2 phyla; post-singleton-decay filter applied.
- U2-type (31,330 introns) from the same species panel, sampled to balance the U12 positives across clades and provide adequate coverage of the U2-type score-distribution tail.
- Flanks: 50 bp per side, oriented so the bases adjacent to each splice site are preserved.
[u2, u12]_reference_human.introns.iic.gz (legacy v2.3)
Retained for backwards compatibility and as the historical baseline:
- U12-type (472 introns): superset of the original 387-intron set (U12DB, SpliceRack, retention studies), plus 85 IPA-conserved additions.
- U2-type (30,155 introns): 20,690 vertebrate-conserved (human-mouse-zebrafish; human-macaque-marmoset) + 9,465 IPA-conserved.
- Branch point positions derived from CoLa-seq empirical data (Zeng et al. 2022).
- Flanks: 200 bp per side.
To train against the legacy human set, pass --reference-u12s u12_reference_human.introns.iic.gz --reference-u2s u2_reference_human.introns.iic.gz (or equivalent absolute paths).
Reference sequence formatting
The required tab-separated columns in the reference sequence files are:
- Name (or whitespace)
- Upstream flanking sequence (>= the number of upstream bases in the 5' scoring region)
- Intron sequence
- Downstream flanking sequence (>= the number of downstream bases in the 3' scoring region)
Some additional bookkeeping information is included in the default reference sequences; if supplying your own sequences, only the above columns are necessary.
HomSap-ENSG00000174444@ENST00000307961_3(9) 99.715 GGATGTTCTGATGCTGTTTGCTACCATGCATAATAGTCATGGGTTGGTATGTATGATGAAGCAGCTACAACTTTTATCTCTT
CTCAATTTTAGGTCATCAGACTAGTGCTGAGTCTTGGGGTACTGGCAGAGCTGTGGCTCGAATTCCCAGAGTTCGAGGTGGTGGGACTCACCGCTCTGGCCAGGGTGCTTTTGGAAAC GTATCCTTTGTTTCACTACC
TAAGACTGGTCATCTCTGATGGAATTTAGCGGCCTGGGTCTGGTTTATTGATGATAATACGGTGTAAAAATATTACTTTTTTTTGTCTTGAAGAGAAGGGGCTTCATTTATATGGGGTTATTTTGCTTGCAATGATGTCGTAATTT
GCGTCTTACTCTGTTCTCAGCGACAGTTGCCTGCTGTCAGTAAGCTGGTACAGAAGGTTGACGAAAATTCTTACTGAGCAAGAAATAACCTTGTTGTAATTACTAAAATTTGAGAAATGTGATTCTTGACTGGAAAAATAG
ATGTGTCGTGGAGGCCGAATGTTTGCACCAACCAAAACCTGGCGCCGTTGGCATCGTAGAGTGAACACAACCCAAAAACGATACGCCATCTGTTCTGCCCTGGCTGCCTCAGCCCTACCAGCACTGGTCATGTCTAAAGGTTTGTA
ATACTTTATATAAGAGAGTTTGATAGAAGAAATAAGACACCTACTATTTGATCA 10.1093/nar/gku391,10.1261/rna.071423.119,HomSap-CalJac-MicMur,HomSap-MusMus-DanRer,
U12DB,human-mouse-chicken-hagfish 6
Position-weight matrices
intronIC_scoring_PWMs.json (Default)
A unified JSON file containing all position-weight matrices (PWMs) for scoring intron splice site and branch point regions. This is the recommended format for v2.
Included matrices (v2.3.0):
- U12-type AT-AC and GT-AG: 5' splice site, 3' splice site, and branch point
- U2-type GT-AG, GC-AG, and AT-AC: 5' splice site, 3' splice site, and branch point
- Branch point PWMs: 12 bp motifs (positions -9 to +2 relative to branch A) derived from CoLa-seq empirical branch point positions (Zeng et al. 2022)
- Each BPS PWM includes a
reference_offsetfield indicating the array index of the branch point adenosine (position 0)
JSON structure:
{
"format_version": "1.0",
"matrix_groups": [
{
"description": ["U12-type intron scoring matrices"],
"matrices": {
"u12_atac_five": {
"start_index": -20,
"sample_size": 114,
"bases": ["A", "C", "G", "T"],
"matrix": [
[0.262295, 0.245901, 0.327868, 0.163934],
...
]
}
}
}
]
}
Matrix naming convention: {type}_{boundary}_{region}[_v{version}]
type:u12oru2boundary:atac,gtag,gcagregion:five,bp,threeversion(optional): For branch point variants (e.g.,vA10)
scoring_matrices.fasta.iic (Legacy)
The original FASTA-like format is still supported for backwards compatibility but is no longer the default.
>u12_atac_five start=-20 (n=61)
A C G T
0.262295082 0.2459016393 0.3278688525 0.1639344262
0.2295081967 0.262295082 0.1967213115 0.3114754098
0.2459016393 0.262295082 0.262295082 0.2295081967
Customization
Any/all of the above files can be replaced as the defaults either permanently by modifying the existing files, or in one-off fashion using the --pwms and --reference-u12s/--reference-u2s command line arguments. The --pwms option supports .json, .yaml, and legacy .iic formats.