Output files - glarue/intronIC GitHub Wiki

Output files

intronIC will automatically generate a set of output files. A brief description of the contents of each file follows (numbered lists represent the columns within each file).

Note: The meta.iic and score_info.iic files include column headers by default. Use --no-headers to omit headers if needed for compatibility with legacy pipelines.

For many use-cases, the most important files will be:

  • introns.iic - full intron sequences, with U12-type probability score in 2nd column

  • bed.iic - BED file of intron coordinates, with U12-type probability score in 5th column

  • meta.iic - metadata file with intron info such as parent transcript/gene, oridinal index, phase, position as % of coding sequence, etc.

annotation.iic

If putatively misannotated introns are found, intronIC will correct their coordinates by adjusting the features that define them in this file. These entries will contain a 'shift' tag that indicates the change made to either their start or stop coordinate (or both). This file is only made if misannotated introns are found (otherwise, it would be identical to the original annotation).

demoted.iic

NOTE: this option is disabled by default; see command line arguments for details

Scoring information for putative U12-type introns whose scores fell below the threshold after their boundaries were switched to GT-AG (as a check to avoid scoring non-canonical introns as U12-type by way of superficial similarities to U12-type motifs):

  1. label
  2. initial score (with five and bp scores in parentheses), followed by reduced score after boundary switching

dupe_map.iic

A mapping table of unique, scored intron labels and their corresponding duplicate intron labels:

HomSap-gene-DIAPH2@rna-NM_006729.5_23(26)     HomSap-gene-DIAPH2@rna-NM_007309.4_23(26);[o:i];[i]
  1. scored intron label
  2. duplicate intron label

introns.iic

All of the annotated intron sequences, including any introns not meeting scoring criteria (e.g. too short, including non-ATCG characters in scoring regions, etc.; includes duplicate introns if run with -d):

HomSap-gene-DIAPH2@rna-NM_006729.5_23(26)      100.0   TTTCAGCTCAAATTCTCAAGAGCAACCTTGCATCAATGGAACAACAAATTGTTCATCTGGAACGTGACATCAAGAAATTCCCCCAAGCAGAAAATCAACACGATAAGTTTGTGGAAAAGATGACC        ATATCCTTTATTTAT[...]TTAACAAAAAGCTAC       AGCTTTACAAAGACTGCCCGAGAACAGTATGAAAAACTCTCCACCATGCACAACAACATGATGAAGCTCTATGAGAATCTTGGAGAATACTTCATTTTTGACTCAAAGACAGTGAGCATAGAAGAGTTCTTTGGTGATCTCAACAACTTCCGAACTTTGTTTTTG
  1. label (without score tag)
  2. score (or '.' if unscored)
  3. upstream (5′ exon) sequence (100 nt by default, configurable via --flank-len)
  4. intron sequence
  5. downstream (3′ exon) sequence (100 nt by default, configurable via --flank-len)

bed.iic

A BED format file of intron coordinates with U12-type probability scores and labels:

NC_000023.11      97247840        97348115        HomSap-gene-DIAPH2@rna-NM_006729.5_23(26);9.999% 99.99999999997891    +
  1. genomic region (e.g. 'chr1')
  2. start coordinate (0-indexed)
  3. end coordinate (1-indexed)
  4. label (including rounded score tag)
  5. U12-type probability score (0-100)
  6. strand

Note: The BED file is not created when using sequence-only input (-q flag), as BED format requires real genomic coordinates.

log.iic

A log of all of the information generated during operation, including total number of introns processed, excluded, etc., and total number of U12-type introns identified.

pwms.iic

A FASTA file of the PWMs used, including those built from the experimental dataset (not including pseudocount values).

meta.iic

Additional metadata for each intron (ordered by increasing U12-type score):

HomSap-gene-DIAPH2@rna-NM_006729.5_23(26)        10.0    AT-AC   ACC|ATATCCTTTA...TGTTCCTTAACA/ATGTTCCTTAAC...GCTAC|AGC  TGATTGATTGCCTTTAAAAGGTACTGTTGAGCCA[TGTTCCTTAACA]AAAAGCTAC    100276  rna-NM_006729.5 gene-DIAPH2     23      26      86.025  0       u12     cds
  1. label
  2. relative score based upon score threshold (U2-type <= 0 < U12-type)
  3. terminal dinucleotides (e.g. GT-AG, AT-AC)
  4. motif string (5', U12/U2 BPS, 3')
  5. BPS location in context of 3' end
  6. bp_offset — branch point adenosine position relative to 3'SS (e.g. -13)
  7. length (bp)
  8. parent transcript
  9. parent gene
  10. ordinal position in transcript
  11. total introns in transcript
  12. fractional position in transcript as a percentage of the coding length, e.g. 50.0 for an intron that interrupts the coding sequence between codons 15 and 16 out of 30.
  13. phase (0, between codons; 1, after the first base of a codon; 2, after the second base of a codon)
  14. binary classification made by the classifier (u12 or u2), which may include introns of various probabilities within each class (i.e. introns labeled "u12" by the classifier may include introns with probabilities significantly lower than the specified threshold)
  15. genomic feature used to define the intron (e.g. cds, exon)
  16. attributes (comma-separated tags indicating special conditions, e.g. noncanonical, corrected, not_longest_isoform, duplicate)

score_info.iic

Various scoring information (in order of increasing score). The file includes column headers by default. As of v2.7 the file has additional columns beyond the 32 documented below; column order may vary across runs (refer to the header row in the file itself when parsing). The 32 core tab-separated columns are:

  1. intron label
  2. relative score (maximum precision); U2-type <= 0 < U12-type
  3. SVM-assigned U12-type probability score (0-100) (averaged across ensemble models)
  4. 5′ sequence used for scoring
  5. 5′ log-ratio score
  6. 5′ z-score
  7. U12-type branch point sequence used for scoring
  8. U2-type branch point sequence used for scoring
  9. branch point log-ratio score
  10. branch point z-score
  11. 3' sequence used for scoring
  12. 3' log-ratio score
  13. 3' z-score
  14. min(5'z, BPz) composite feature
  15. min(5'z, 3'z) composite feature
  16. max(5'z, BPz) composite feature
  17. max(5'z, 3'z) composite feature
  18. distance from hyperplane (raw classifier output prior to probability calibration)
  19. bp_offset — branch point adenosine position relative to 3'SS (negative integer, e.g. -13)
  20. ppt_ct — C+T fraction at positions -14 to -7 (legacy PPT metric)
  21. ppt_raw — PWM log-ratio for PPT region
  22. core_3'_raw — PWM log-ratio for core 3'SS only
  23. fit_u12 — summed log2(P_U12) across all three regions
  24. fit_u2 — summed log2(P_U2) across all three regions
  25. fit_u12_5' — log2(P_U12) for 5'SS region
  26. fit_u12_bp — log2(P_U12) for BP region
  27. fit_u12_3' — log2(P_U12) for 3'SS region
  28. min_fit_bp_3 — min(fit_u12_bp, fit_u12_3')
  29. ppt_longest_run — longest uninterrupted C/T run in 20 nt window near 3'SS
  30. ppt_t_weighted — T-weighted pyrimidine score (T=1.0, C=0.5, purine=0) in same window
  31. adjusted_score — post-adjustment U12-type probability (0-100). In v2.7+ this is the calling column: it is the second-pass mode-separation svm_score after the v2.7 continuous per-intron discount on the gate-pass path, or the legacy Bayesian valley-depth + ensemble-agreement adjustment (chained through the v2.7 discount) on the gate-fail path. See Technical Details
  32. ensemble_sigma — standard deviation of per-model U12-type probabilities across the second-pass ensemble (126 sub-models in the v2.7 default; 0-100 scale). Low values indicate model consensus. Non-zero only for introns scored by the mode-separation second pass (modesep_route == "modesep").

Additional v2.6+ columns (present when mode-separation is in effect):

  • first_pass_svm — first-pass cluster-aware SVM probability (0-100). Lets users compare first-pass and second-pass calls without a separate file.
  • modesep_route"modesep" for second-pass-scored introns; "untouched" for introns ineligible for the second pass (e.g., 5' z-score below the floor) which keep their first-pass score.

Additional v2.7+ columns (always present when mode-separation is in effect):

  • raw_sum — unweighted motif log-LR sum: 5'_raw + bp_raw + 3'_raw.
  • svm_vs_naive — calibration delta: logit(p_svm) − raw_sum. Drives the v2.7 overcall penalty.
  • voting_frac — fraction of second-pass sub-models voting U12 (per-model P > 0.5). Useful as an ensemble-agreement diagnostic in addition to ensemble_sigma.

Understanding the scores

SVM Score (0-100)

The primary classification score representing the probability that an intron is U12-type:

Score Range Interpretation
>90 High confidence U12-type (default threshold)
50-90 Intermediate confidence
<50 More likely U2-type
<10 High confidence U2-type

Relative Score

relative_score = adjusted_score - threshold

Positive values indicate the intron exceeds the high-confidence threshold; negative values fall below it. The magnitude measures distance from the threshold. Note that the binary type_id (U12/U2) is set by the raw SVM decision boundary and does not depend on the threshold.

Adjusted Score

As of v2.7 the adjusted score is produced by two layered steps:

  1. Mode-separation second pass (v2.6+) — On gate-pass species, the svm_score column is the probability emitted by the second-pass v5_modesep_aug ensemble after per-species recalibration. On gate-fail species (U12-absent, non-bimodal, or with adversarial first-pass mode estimates), the legacy Bayesian valley-depth + ensemble-agreement adjustment is applied to first-pass scores; this produces the input to step 2 on the gate-fail path.

  2. Continuous per-intron discount (v2.7+) — A non-positive log-odds penalty is applied to every intron: penalty_overcall = k_overcall × max(0, svm_vs_naive − τ_overcall), where svm_vs_naive = logit(p_in) − raw_sum and raw_sum = 5'_raw + bp_raw + 3'_raw. The penalty fires when the SVM overcalls relative to motif log-LR; it is zero in the healthy regime. The result is written to adjusted_score.

svm_score is preserved as the raw classifier output (for auditability); adjusted_score is the calling column.

For species with strong U12-type intron populations (human, Drosophila), the discount typically fires only on long-tail loose-or-NA introns and IPA-validated TPs are preserved. For ambiguous species the gate-fail path adds the legacy valley-depth discount before the v2.7 penalty, suppressing false positives without altering the binary classification label.

See Technical Details for the formula and parameter documentation, and docs/mode_separation.md in the repo for the full mode-separation architecture.

Raw vs Z-Scores

  • Raw scores (log-odds ratios): log₂(P(seq|U12) / P(seq|U2)) — Different ranges for each region
  • Z-scores: Normalized for comparison — Unit variance, centered around reference distribution

See Technical Details for information on the normalization approach.


Common operations

Finding U12-type introns

# From meta.iic (using relative score)
awk '($2!="NA" && $2>0)' species.meta.iic

# From bed.iic (using SVM score)
awk '$5 > 95' species.bed.iic

# Count total U12-type introns
awk '($2!="NA" && $2>0)' species.meta.iic | wc -l

Extracting specific types

# AT-AC introns only
awk '($3 == "AT-AC")' species.meta.iic

# High-confidence U12-type AT-AC introns
awk '($2 > 5 && $3 == "AT-AC")' species.meta.iic

# U12-type introns in first half of transcript
awk '($2>0 && $12 < 50)' species.meta.iic

Converting to FASTA

# All U12-type intron sequences
awk '$2 > 90 {print ">"$1"\n"$4}' species.introns.iic > u12.fasta