Example usage - glarue/intronIC GitHub Wiki
Example usage
This page provides practical examples for common intronIC use cases. For full argument documentation, see the Usage info page.
Quick test
The easiest way to verify your installation:
# Run bundled test (Human Chr19, ~1 min with -p 4)
intronIC test -p 4
# Show test data location on your system
intronIC test --show-only
Test data for manual runs
If you prefer to run classification manually with test data:
- Test data is bundled with the package—use
intronIC test --show-onlyto find its location - Alternatively, download the chromosome 19 test files:
Basic usage
Classification (recommended for most users)
The default pretrained model is loaded automatically:
intronIC -g Homo_sapiens.Chr19.Ensembl_91.fa.gz \
-a Homo_sapiens.Chr19.Ensembl_91.gff3.gz \
-n homo_sapiens
This works for virtually all species. You can optionally specify a custom model:
intronIC -g genome.fa -a annotation.gff -n species --model custom.model.pkl
Training a new model
To train a model on reference sequences:
intronIC train -n homo_sapiens
This creates a .model.pkl file that can be used for classification; model training (depending on selected options) can take many hours. The default model should serve most users well in most cases.
Extracting intron sequences only
To extract introns without classification:
intronIC extract -g genome.fa -a annotation.gff -n species
Information about the run will be printed to the screen; this same information (plus some additional details) can be found in the log.iic file. The excerpt below is illustrative — exact line counts, status lines, and model-loading messages may differ across versions; v2.7 adds mode-separation and continuous-discount log lines not shown here.
================================================================================
intronIC v2.7.0
Started: 2025-12-08 12:44:39
================================================================================
Command and Configuration:
Command: /home/glarue/code/intronIC/.pixi/envs/default/bin/intronIC -g GCF_000001405.40_GRCh38.p14_genomic.fna.gz -a
GCF_000001405.40_GRCh38.p14_genomic.gff.gz -n homo_sapiens.cds -p 8 -f cds
Working directory: /home/glarue/code/intronIC/run_tests/hsapiens
Run name: homo_sapiens.cds
Input mode: annotation
Classification threshold: 90.0%
Output directory: /home/glarue/code/intronIC/run_tests/hsapiens
Genome: /home/glarue/code/intronIC/run_tests/hsapiens/GCF_000001405.40_GRCh38.p14_genomic.fna.gz
Annotation: /home/glarue/code/intronIC/run_tests/hsapiens/GCF_000001405.40_GRCh38.p14_genomic.gff.gz
Model: /home/glarue/code/intronIC/src/intronIC/data/default_pretrained.model.pkl
ℹ Streaming mode: processing per-contig
ℹ Loading pretrained model from /home/glarue/code/intronIC/src/intronIC/data/default_pretrained.model.pkl
Loaded two-pass mode-separation bundle (first-pass v4_aug_cluster_aware + second-pass v5_modesep_aug; 126 models per ensemble, 3 seeds × 42 sub-models)
Adaptive normalizer fit: scoring introns through PWMs to fit RobustScaler
ℹ Loading PWM matrices
ℹ Indexing annotation: GCF_000001405.40_GRCh38.p14_genomic.gff.gz
Indexed 4,932,571 annotations across 705 contigs
ℹ Using indexed genome access: GCF_000001405.40_GRCh38.p14_genomic.fna.gz
ℹ Processing 705 contigs in parallel (8 processes)
Merging output: 202,594 (11.89%) scored + 45,650 (2.68%) omitted = 248,244 (14.56%) total introns for output files
ℹ Streaming classification complete: 202,594 introns classified
Total genes: 55,619, introns generated: 1,704,427
Intron Filtering Summary:
┌────────────────────────────┬────────────┬────────────┐
│ Category │ Included │ Excluded │
├────────────────────────────┼────────────┼────────────┤
│ Duplicates │ 0 │ 1,457,363 │
│ Too short │ 0 │ 240 │
│ Ambiguous bases │ 0 │ 4 │
│ Non-canonical │ 525 │ 0 │
│ Overlapping │ 0 │ 0 │
│ Alternative isoform │ 0 │ 45,211 │
├────────────────────────────┼────────────┼────────────┤
│ Total excluded │ │ 1,502,818 │
│ Retained for scoring │ │ 201,414 │
└────────────────────────────┴────────────┴────────────┘
Classification Results (threshold: 90.0%):
┌──────────────────────┬───────────┬────────────┐
│ Type │ Count │ Percentage │
├──────────────────────┼───────────┼────────────┤
│ U12-type (total) │ 702 │ 0.35% │
│ U12-type (AT-AC) │ 185 │ 0.09% │
│ U2-type │ 201,892 │ 99.65% │
├──────────────────────┼───────────┼────────────┤
│ Total │ 202,594 │ 100.00% │
└──────────────────────┴───────────┴────────────┘
Sequence extraction only
If only the intron sequences are desired, use the extract subcommand which skips classification and produces only a subset of the output files:
intronIC extract -g genome.fa -a annotation.gff -n species
Using configuration files
For complex or reproducible runs, use a YAML configuration file:
# Generate a template configuration file
intronIC --generate-config > my_config.yaml
# Edit my_config.yaml, then run:
intronIC --config my_config.yaml -g genome.fa -a annotation.gff -n species
Example configuration:
scoring:
threshold: 90.0
exclude_noncanonical: false
score_adjustment:
enabled: true
extraction:
flank_length: 100
feature_type: both
training:
ensemble:
n_models: 42
eval_mode: nested_cv
performance:
processes: 8
Advanced: Custom normalization (rarely needed)
For most species, the default settings work well. The v3 model bundle ships with a multispecies fallback scaler that is used automatically for very small inputs (fewer than 200 scoreable introns), so single-intron / tiny-annotation runs work out of the box.
Two cases where you might want to override the default:
Reproducible normalization across runs on genome subsets
# First run: fit and save adaptive normalizer on full genome
intronIC -g genome.fa -a annotation.gff -n species \
--normalizer-mode adaptive --save-normalizer
# Subsequent runs: reuse the normalizer
intronIC -g subset.fa -a subset.gff -n species \
--load-normalizer species.normalizer.pkl
Force the bundled multispecies scaler
If you want to suppress per-species z-score shifts entirely (e.g., for U12-absent / outlier genomes where adaptive can compress the U2 distribution):
intronIC -g genome.fa -a annotation.gff -n species \
--normalizer-mode human
In a v3 bundle, --normalizer-mode human resolves to the bundled multispecies fallback scaler.
Note: Both are advanced features. For standard analyses on normal-sized genomes, default settings are correct.
Parallel processing
Speed up analysis with parallel processes (streaming mode is default and scales efficiently):
intronIC -g genome.fa -a annotation.gff -n species -p 8
The -p flag parallelizes the entire extraction and scoring pipeline. With streaming mode (default), using -p 5-10 typically provides 2-3× speedup with moderate memory usage.
Memory modes
--streaming and --in-memory produce bit-identical classifications since v2.4 (covered by tests/integration/test_streaming_equivalence.py); the choice is purely a runtime/memory tradeoff. Reference run on full human GRCh38.p13 + NCBI RefSeq GFF, 257k scored introns, -p 5, default v2.7 bundle:
| Mode | Wall time | Peak RSS |
|---|---|---|
--streaming (default) |
~40 min | ~5.3 GB |
--in-memory |
(not re-measured for v2.7; expected similar wall time, roughly 2× peak memory) |
Streaming mode (default)
# Streaming mode is automatic — no flag needed
intronIC -g GRCh38.fa.gz -a gencode.gff3.gz -n homo_sapiens -p 8
Streaming writes intron sequences to a temporary on-disk SQLite database during extraction, keeps only scoring motifs in memory, and parallelizes each phase (extraction, BG correction, adaptive-normalizer fit, first-pass classification, mode-separation second pass) per-contig.
In-memory mode
intronIC -g genome.fa -a annotation.gff -n species --in-memory
In-memory loads all intron sequences into memory at extraction time. It is also the path used internally by --sequences and --bed input modes (those bypass the per-contig streaming pipeline). On small single-contig inputs the per-contig overhead of streaming mode means in-memory is somewhat faster; on multi-contig genomes the two are roughly tied at typical parallelism levels.
Many additional options exist for a variety of use cases. Run intronIC --help for additional details and/or see the Full usage info page.