Quick start - glarue/intronIC GitHub Wiki
Quick start/testing
Installation
Using pip (recommended)
Install the last stable version from PyPI:
python -m pip install intronIC
Or install the latest version directly from GitHub:
python -m pip install git+https://github.com/glarue/intronIC
To upgrade to the latest version:
python -m pip install git+https://github.com/glarue/intronIC --upgrade
Using pixi (for development)
Pixi manages all dependencies automatically:
# Install pixi
curl -fsSL https://pixi.sh/install.sh | bash
# Clone and set up
git clone https://github.com/glarue/intronIC.git
cd intronIC
pixi install
pixi run intronIC --help
From source
Clone the repository and install in development mode:
git clone https://github.com/glarue/intronIC.git
cd intronIC
pip install -e .
Verifying Installation
After installing, verify it works with the bundled test data:
# Quick installation test (~1 minute with -p 4)
intronIC test -p 4
# Show where test data is located
intronIC test --show-only
This runs a smoke test to ensure intronIC is working correctly.
Dependencies
intronIC requires Python 3.10+ and the following packages:
- numpy
>=1.19.0— Numerical operations - scipy
>=1.5.0— Scientific computing - scikit-learn
>=0.22— SVM classifier - biogl
>=3.0— Bioinformatics utilities - matplotlib (optional) — Plotting
- rich (optional) — Progress bars
- pyyaml (optional) — Configuration files
All required dependencies are installed automatically by pip.
intronIC was developed on Linux and has only been minimally tested on macOS and Windows.
Useful arguments
The required arguments for any classification run include a name (-n; see note below), along with:
- Genome (
-g) and annotation/BED (-a,-b) files or, - Intron sequences file (
-q) (see Training data and PWMS for formatting information, which matches the reference sequence format)
By default, intronIC includes non-canonical introns, considers only the longest isoform of each gene, and uses streaming mode for memory efficiency. Helpful arguments may include:
-
-pparallel processes, which can significantly reduce runtime -
-f cdsuse onlyCDSfeatures to identify introns (by default, uses bothCDSandexonfeatures) -
--no-ncexclude introns with non-canonical (non-GT-AG/GC-AG/AT-AC) boundaries -
-iinclude introns from multiple isoforms of the same gene (default: longest isoform only) -
--no-streamingdisable streaming mode (uses more memory but avoids temporary storage) -
--configpath to YAML configuration file for advanced settings
Configuration files
intronIC supports YAML configuration files for managing complex runs. Configuration files are searched in this order:
- Path specified by
--config .intronIC.yamlin current directory~/.config/intronIC/config.yaml~/.intronIC.yamlin home directory- Built-in defaults
CLI arguments always override config file values. To generate a template configuration file:
intronIC --generate-config > my_config.yaml
Example configuration:
scoring:
threshold: 90.0
exclude_noncanonical: false
extraction:
flank_length: 100
feature_type: both
performance:
processes: 8
Use with:
intronIC --config my_config.yaml -g genome.fa -a annotation.gff -n species
Running on test dataset
The easiest way to test intronIC is with the bundled test data:
intronIC test -p 4
This automatically uses the included chromosome 19 test data and verifies your installation.
Manual test with custom data
If you prefer to manually test with specific files:
-
If you have installed via
pip, the test data is bundled with the package. UseintronIC test --show-onlyto see the location, or download the chromosome 19 FASTA and GFF3 sample files into a directory of your choice. -
If you have cloned the repo, first change to the
src/intronIC/data/test_datasubdirectory, which contains Ensembl annotations and sequence for chromosome 19 of the human genome. From the repo root, you can runpython -m intronICinstead ofintronICin the following examples.
Classify annotated introns
intronIC -g Homo_sapiens.Chr19.Ensembl_91.fa.gz -a Homo_sapiens.Chr19.Ensembl_91.gff3.gz -n homo_sapiens
The various output files contain different information about each intron; information can be cross-referenced by using the intron label (usually the first column of the file). U12-type introns are those (by default) with adjusted probability scores >90%, or equivalently relative scores >0. For example, here is an example U12-type AT-AC intron from the meta.iic file:
HomSap-ENSG00000141837@ENST00000614285_1(47);[c:-1] 10.0 AT-AC GCC|ATATCCTTTT...TTTTCCTTAATT/TTTTTCCTTAAT...AATAC|TCC CACCTCCAACACCCTTCTTTTCTTTGAACAAGAT[TTTTCCTTAATT]CCCCAATAC 50719 ENST00000614285 ENSG00000141837 1 47 0.0 2 u12 cds corrected
To retrieve all U12-type introns from this file, one can filter based on the relative score (2nd column; U12-type introns have relative scores >0), e.g.
awk '($2!="NA" && $2>0)' homo_sapiens.meta.iic
Extract all annotated intron sequences
If you just want to retrieve all annotated intron sequences (without classification), use the extract subcommand:
intronIC extract -g Homo_sapiens.Chr19.Ensembl_91.fa.gz -a Homo_sapiens.Chr19.Ensembl_91.gff3.gz -n homo_sapiens
See the rest of the Wiki for more extensive details about output files, usage info, etc.
A note on the -n (name) argument
By default, intronIC expects names in binomial (genus, species) form separated by a non-alphanumeric character, e.g. 'homo_sapiens', 'homo.sapiens', etc. intronIC then formats that name internally into a tag that it uses to label all output intron IDs, ignoring anything past the second non-alphanumeric character.
Output files, on the other hand, are named using the full name supplied via -n. If you'd prefer to have it leave whatever argument you supply to -n unmodified, use the --na flag.
If you are running multiple versions of the same species and would like to keep the same species abbreviations in the output intron data, simply add a tag to the end of the name, e.g. "homo_sapiens.v2"; the tags within files will be consistent ("HomSap"), but the file names across runs will be distinct.
Resource usage
Streaming vs in-memory
--streaming (the default) writes intron sequences to temporary on-disk storage during extraction and keeps only scoring motifs in memory; --in-memory keeps everything in memory. The two modes produce bit-identical classifications (locked in by integration tests as of v2.4); they differ only in the runtime/memory tradeoff.
To disable streaming (uses more memory but avoids temporary storage):
intronIC -g genome.fa -a annotation.gff -n species --in-memory
Benchmark
Reference run on Homo sapiens GRCh38.p13 + NCBI RefSeq GFF, classifying 257k introns with -p 5 on commodity hardware (single workstation, NVMe SSD), default v2.7 bundle (mode-separation two-pass + continuous discount):
| Species | Mode | Wall time | Peak RSS |
|---|---|---|---|
| Drosophila melanogaster (Release 6 + ISO1_MT, 47k scored introns) | --streaming |
~8 min | ~0.8 GB |
| Homo sapiens (GRCh38.p13, 257k scored introns) | --streaming |
~40 min | ~5.3 GB |
--in-memory was not re-measured for v2.7; based on the v2.4 ratio it is expected to finish at essentially the same wall time on multi-contig genomes with roughly 2× the peak memory. Streaming and in-memory remain bit-identical (covered by integration tests).
Note on v2.6+ runtime. Mode-separation requires two SVM passes per intron (a first-pass cluster-aware classifier produces the candidate weights used to estimate per-species U12/U2 modes; a second-pass mode-separation classifier then re-scores eligible introns). Both ensembles are 126-model. This roughly doubles SVM compute compared to the single-pass v2.4 default, which is the main reason HomSap wall time grew from ~16 min (v2.4) to ~40 min (v2.7) at similar parallelism.
Scaling
Memory and runtime scale with the number of annotated introns rather than genome size. For non-human genomes:
- Non-model genomes (typical): ~1-5 GB peak, ~3-10 min (
-p 8) - Small test datasets: well under 1 GB, ~1-3 min
These estimates are for classification with the default v2.7 bundle (two 126-model RBF SVM ensembles: first-pass v4_aug_cluster_aware + second-pass v5_modesep_aug). Model training with intronIC train can take significantly longer (minutes to hours) depending on configuration. Using more parallel processes (-p N) reduces runtime in the scoring phase but extraction is largely I/O-bound.