Quick start - glarue/intronIC GitHub Wiki

Quick start/testing

Installation

Using pip (recommended)

Install the last stable version from PyPI:

python -m pip install intronIC

Or install the latest version directly from GitHub:

python -m pip install git+https://github.com/glarue/intronIC

To upgrade to the latest version:

python -m pip install git+https://github.com/glarue/intronIC --upgrade

Using pixi (for development)

Pixi manages all dependencies automatically:

# Install pixi
curl -fsSL https://pixi.sh/install.sh | bash

# Clone and set up
git clone https://github.com/glarue/intronIC.git
cd intronIC
pixi install
pixi run intronIC --help

From source

Clone the repository and install in development mode:

git clone https://github.com/glarue/intronIC.git
cd intronIC
pip install -e .

Verifying Installation

After installing, verify it works with the bundled test data:

# Quick installation test (~1 minute with -p 4)
intronIC test -p 4

# Show where test data is located
intronIC test --show-only

This runs a smoke test to ensure intronIC is working correctly.

Dependencies

intronIC requires Python 3.10+ and the following packages:

  • numpy >=1.19.0 — Numerical operations
  • scipy >=1.5.0 — Scientific computing
  • scikit-learn >=0.22 — SVM classifier
  • biogl >=3.0 — Bioinformatics utilities
  • matplotlib (optional) — Plotting
  • rich (optional) — Progress bars
  • pyyaml (optional) — Configuration files

All required dependencies are installed automatically by pip.

intronIC was developed on Linux and has only been minimally tested on macOS and Windows.

Useful arguments

The required arguments for any classification run include a name (-n; see note below), along with:

  1. Genome (-g) and annotation/BED (-a, -b) files or,
  2. Intron sequences file (-q) (see Training data and PWMS for formatting information, which matches the reference sequence format)

By default, intronIC includes non-canonical introns, considers only the longest isoform of each gene, and uses streaming mode for memory efficiency. Helpful arguments may include:

  • -p parallel processes, which can significantly reduce runtime

  • -f cds use only CDS features to identify introns (by default, uses both CDS and exon features)

  • --no-nc exclude introns with non-canonical (non-GT-AG/GC-AG/AT-AC) boundaries

  • -i include introns from multiple isoforms of the same gene (default: longest isoform only)

  • --no-streaming disable streaming mode (uses more memory but avoids temporary storage)

  • --config path to YAML configuration file for advanced settings

Configuration files

intronIC supports YAML configuration files for managing complex runs. Configuration files are searched in this order:

  1. Path specified by --config
  2. .intronIC.yaml in current directory
  3. ~/.config/intronIC/config.yaml
  4. ~/.intronIC.yaml in home directory
  5. Built-in defaults

CLI arguments always override config file values. To generate a template configuration file:

intronIC --generate-config > my_config.yaml

Example configuration:

scoring:
  threshold: 90.0
  exclude_noncanonical: false

extraction:
  flank_length: 100
  feature_type: both

performance:
  processes: 8

Use with:

intronIC --config my_config.yaml -g genome.fa -a annotation.gff -n species

Running on test dataset

The easiest way to test intronIC is with the bundled test data:

intronIC test -p 4

This automatically uses the included chromosome 19 test data and verifies your installation.

Manual test with custom data

If you prefer to manually test with specific files:

  • If you have installed via pip, the test data is bundled with the package. Use intronIC test --show-only to see the location, or download the chromosome 19 FASTA and GFF3 sample files into a directory of your choice.

  • If you have cloned the repo, first change to the src/intronIC/data/test_data subdirectory, which contains Ensembl annotations and sequence for chromosome 19 of the human genome. From the repo root, you can run python -m intronIC instead of intronIC in the following examples.

Classify annotated introns

intronIC -g Homo_sapiens.Chr19.Ensembl_91.fa.gz -a Homo_sapiens.Chr19.Ensembl_91.gff3.gz -n homo_sapiens

The various output files contain different information about each intron; information can be cross-referenced by using the intron label (usually the first column of the file). U12-type introns are those (by default) with adjusted probability scores >90%, or equivalently relative scores >0. For example, here is an example U12-type AT-AC intron from the meta.iic file:

HomSap-ENSG00000141837@ENST00000614285_1(47);[c:-1]     10.0    AT-AC   GCC|ATATCCTTTT...TTTTCCTTAATT/TTTTTCCTTAAT...AATAC|TCC  CACCTCCAACACCCTTCTTTTCTTTGAACAAGAT[TTTTCCTTAATT]CCCCAATAC     50719   ENST00000614285 ENSG00000141837 1       47      0.0     2       u12     cds     corrected

To retrieve all U12-type introns from this file, one can filter based on the relative score (2nd column; U12-type introns have relative scores >0), e.g.

awk '($2!="NA" && $2>0)' homo_sapiens.meta.iic

Extract all annotated intron sequences

If you just want to retrieve all annotated intron sequences (without classification), use the extract subcommand:

intronIC extract -g Homo_sapiens.Chr19.Ensembl_91.fa.gz -a Homo_sapiens.Chr19.Ensembl_91.gff3.gz -n homo_sapiens

See the rest of the Wiki for more extensive details about output files, usage info, etc.

A note on the -n (name) argument

By default, intronIC expects names in binomial (genus, species) form separated by a non-alphanumeric character, e.g. 'homo_sapiens', 'homo.sapiens', etc. intronIC then formats that name internally into a tag that it uses to label all output intron IDs, ignoring anything past the second non-alphanumeric character.

Output files, on the other hand, are named using the full name supplied via -n. If you'd prefer to have it leave whatever argument you supply to -n unmodified, use the --na flag.

If you are running multiple versions of the same species and would like to keep the same species abbreviations in the output intron data, simply add a tag to the end of the name, e.g. "homo_sapiens.v2"; the tags within files will be consistent ("HomSap"), but the file names across runs will be distinct.

Resource usage

Streaming vs in-memory

--streaming (the default) writes intron sequences to temporary on-disk storage during extraction and keeps only scoring motifs in memory; --in-memory keeps everything in memory. The two modes produce bit-identical classifications (locked in by integration tests as of v2.4); they differ only in the runtime/memory tradeoff.

To disable streaming (uses more memory but avoids temporary storage):

intronIC -g genome.fa -a annotation.gff -n species --in-memory

Benchmark

Reference run on Homo sapiens GRCh38.p13 + NCBI RefSeq GFF, classifying 257k introns with -p 5 on commodity hardware (single workstation, NVMe SSD), default v2.7 bundle (mode-separation two-pass + continuous discount):

Species Mode Wall time Peak RSS
Drosophila melanogaster (Release 6 + ISO1_MT, 47k scored introns) --streaming ~8 min ~0.8 GB
Homo sapiens (GRCh38.p13, 257k scored introns) --streaming ~40 min ~5.3 GB

--in-memory was not re-measured for v2.7; based on the v2.4 ratio it is expected to finish at essentially the same wall time on multi-contig genomes with roughly 2× the peak memory. Streaming and in-memory remain bit-identical (covered by integration tests).

Note on v2.6+ runtime. Mode-separation requires two SVM passes per intron (a first-pass cluster-aware classifier produces the candidate weights used to estimate per-species U12/U2 modes; a second-pass mode-separation classifier then re-scores eligible introns). Both ensembles are 126-model. This roughly doubles SVM compute compared to the single-pass v2.4 default, which is the main reason HomSap wall time grew from ~16 min (v2.4) to ~40 min (v2.7) at similar parallelism.

Scaling

Memory and runtime scale with the number of annotated introns rather than genome size. For non-human genomes:

  • Non-model genomes (typical): ~1-5 GB peak, ~3-10 min (-p 8)
  • Small test datasets: well under 1 GB, ~1-3 min

These estimates are for classification with the default v2.7 bundle (two 126-model RBF SVM ensembles: first-pass v4_aug_cluster_aware + second-pass v5_modesep_aug). Model training with intronIC train can take significantly longer (minutes to hours) depending on configuration. Using more parallel processes (-p N) reduces runtime in the scoring phase but extraction is largely I/O-bound.