Usage info - glarue/intronIC GitHub Wiki
This page contains the complete CLI reference for intronIC. For practical examples, see Example usage.
To get full usage information for intronIC, run intronIC --help:
usage: intronIC [-h] [--version] [--quiet] [--debug] [--config CONFIG_PATH]
[--generate-config] [-n SPECIES_NAME] [-o OUTPUT_DIR]
[-g GENOME] [-a ANNOTATION] [-b BED] [-q SEQUENCE_FILE]
[--model MODEL] [--pretrained-model PRETRAINED_MODEL]
[--normalizer-mode {human,adaptive,auto}]
[--species-prior PRIOR] [--load-normalizer PATH]
[--save-normalizer]
[-f {cds,exon,both}] [--min-intron-len MIN_INTRON_LEN] [-i]
[-v] [-d] [--flank-len FLANK_LEN] [--no-nc-ss-adjustment]
[-t THRESHOLD] [--no-nc] [--pseudocount PSEUDOCOUNT]
[--no-ignore-nc-dnts] [--five-score-coords START END]
[--bp-region-coords START END]
[--three-score-coords START END]
[-p PROCESSES] [--cv-processes CV_PROCESSES] [--streaming]
[--clean-names] [--no-clean-names] [-u] [--no-abbreviate]
[--abbreviate-filenames] [--no-headers] [--seed SEED]
{train,classify,extract} ...
intronIC: Intron classification and extraction tool
positional arguments:
{train,classify,extract,test}
Command to run (default: classify if not specified)
train Train a classifier on reference data only (no
genome/annotation needed)
classify Extract and classify introns from genome/annotation
extract Extract intron sequences without classification
test Run installation test with bundled test data
options:
-h, --help show this help message and exit
--version show program's version number and exit
--quiet Suppress non-essential output
--debug Enable debug logging
--config CONFIG_PATH Path to configuration file (auto-loads from standard
paths if not specified)
--generate-config Generate configuration file template and exit
-n SPECIES_NAME, --species-name SPECIES_NAME, --species_name SPECIES_NAME
Species name for output files (e.g., homo_sapiens)
-o OUTPUT_DIR, --output-dir OUTPUT_DIR, --output_dir OUTPUT_DIR
Output directory (default: current directory)
--normalizer-mode {human,adaptive,auto}, --normalizer_mode {human,adaptive,auto}
Normalizer mode for pretrained model classification
(default: auto): human: Use scaler from training
species (recommended for U12-absent genomes) adaptive:
Refit scaler on experimental data (experimental, may
cause FPs in U12-free species) auto: Use human if
available in model, otherwise adaptive
--species-prior PRIOR
Expected U12 prior for target species (0 to 1).
Adjusts classification probabilities via Bayes rule to
account for different U12 base rates. Recommended
values: - 0.005: Human-like species (default if not
specified) - 1e-6: U12-absent species (C. elegans,
many fungi) - 1e-4: U12-poor species Lower values
reduce false positives in U12-free lineages.
--load-normalizer PATH
Load a saved normalizer (from a previous run, or any
compatible scaler pickle) and use it instead of
fitting one. Honored in both streaming and in-memory
modes. When passed, this overrides whatever scaler
the model bundle ships with and skips the per-input
adaptive fit. Typical use: pass --save-normalizer on
a first full-genome run, then --load-normalizer
<path>.normalizer.pkl on subsequent runs over subsets
of the same genome to keep z-scores consistent.
--save-normalizer Save the fitted normalizer for future runs (adaptive
mode only). Use this on your first full-genome run for
a species to establish a reference normalization.
Future runs can use --load-normalizer to reuse this
normalization. Saved to <output_prefix>.normalizer.pkl
input selection:
Choose one mode: (1) -g + -a for annotation, (2) -g + -b for BED, or (3) -q for sequences
-g GENOME, --genome GENOME
Path to genome FASTA file (required with -a or -b)
-a ANNOTATION, --annotation ANNOTATION
Path to GFF3/GTF annotation file (requires -g)
-b BED, --bed BED Path to BED file with intron coordinates (requires -g)
-q SEQUENCE_FILE, --sequence-file SEQUENCE_FILE, --sequence_file SEQUENCE_FILE
Path to pre-extracted intron sequences (.iic format)
model options:
--model MODEL Path to pretrained model (.model.pkl)
--pretrained-model PRETRAINED_MODEL, --pretrained_model PRETRAINED_MODEL
(Deprecated: use --model) Path to pretrained model
extraction parameters:
-f {cds,exon,both}, --feature {cds,exon,both}
Feature type to extract from (default: both)
--min-intron-len MIN_INTRON_LEN, --min_intron_len MIN_INTRON_LEN
Minimum intron length (default: 30)
-i, --allow-multiple-isoforms, --allow_multiple_isoforms
Include non-longest isoforms
-v, --no-intron-overlap, --no_intron_overlap
Exclude overlapping introns
-d, --include-duplicates, --include_duplicates
Include duplicate coordinate introns
--flank-len FLANK_LEN, --flank_len FLANK_LEN
Exonic flank length (default: 100)
--no-nc-ss-adjustment, --no_nc_ss_adjustment
Disable U12 boundary correction
scoring parameters:
-t THRESHOLD, --threshold THRESHOLD
U12 probability threshold 0-100 (default: 90)
--no-nc, --no_nc Exclude non-canonical introns from scoring
--pseudocount PSEUDOCOUNT
PWM pseudocount (default: 0.0001)
--no-ignore-nc-dnts, --no_ignore_nc_dnts
Include terminal dinucleotides in non-canonical
scoring
--five-score-coords START END, --five_score_coords START END
5' splice site region (default: -3 9)
--bp-region-coords START END, --bp_region_coords START END
Branch point region (default: -55 -5)
--three-score-coords START END, --three_score_coords START END
3' splice site region (default: -6 4)
performance options:
-p PROCESSES, --processes PROCESSES
Parallel processes for scoring (default: 1)
--cv-processes CV_PROCESSES, --cv_processes CV_PROCESSES
Processes for cross-validation (training only, default: same as -p)
--in-memory, --no-streaming
Use in-memory mode: load full genome into memory. Higher memory
usage but may be slightly faster for very small genomes.
--streaming Use streaming mode (default): stores sequences in temp on-disk
storage, keeps only scoring motifs in memory. Roughly halves
peak RSS vs --in-memory (e.g., ~5 GB vs ~10 GB for full human at
-p 6). Bit-identical to in-memory.
output options:
--clean-names, --clean_names
Remove "transcript:" and "gene:" prefixes (default:
True)
--no-clean-names, --no_clean_names
Keep ID prefixes
-u, --uninformative-naming, --uninformative_naming
Use simple naming scheme
--no-abbreviate, --no_abbreviate, --na
Use full species name in outputs
--abbreviate-filenames, --abbreviate_filenames, --afn
Abbreviate species name in filenames
--no-headers, --no_headers
Omit column headers from output files (default:
include headers)
advanced options:
--seed SEED Random seed (default: 42)
Examples:
# Test installation with bundled data
intronIC test -p 4
# Train a model on reference data (no genome needed!)
intronIC train -n homo_sapiens
# Classify introns with pretrained model (uses streaming mode by default)
intronIC classify -g genome.fa -a annotation.gff -n species --model species.model.pkl
# Extract intron sequences without classification
intronIC extract -g genome.fa -a annotation.gff -n species
# Parallel processing for faster analysis (streaming mode scales efficiently)
intronIC classify -g genome.fa -a annotation.gff -n species -p 8
# Use in-memory mode for very small genomes
intronIC classify -g genome.fa -a annotation.gff -n species --no-streaming
# Backward compatible (no subcommand = classify)
intronIC -g genome.fa -a annotation.gff -n species --model species.model.pkl