Usage info - glarue/intronIC GitHub Wiki

Usage info

This page contains the complete CLI reference for intronIC. For practical examples, see Example usage.

To get full usage information for intronIC, run intronIC --help:

usage: intronIC [-h] [--version] [--quiet] [--debug] [--config CONFIG_PATH]
                [--generate-config] [-n SPECIES_NAME] [-o OUTPUT_DIR]
                [-g GENOME] [-a ANNOTATION] [-b BED] [-q SEQUENCE_FILE]
                [--model MODEL] [--pretrained-model PRETRAINED_MODEL]
                [--normalizer-mode {human,adaptive,auto}]
                [--species-prior PRIOR] [--load-normalizer PATH]
                [--save-normalizer]
                [-f {cds,exon,both}] [--min-intron-len MIN_INTRON_LEN] [-i]
                [-v] [-d] [--flank-len FLANK_LEN] [--no-nc-ss-adjustment]
                [-t THRESHOLD] [--no-nc] [--pseudocount PSEUDOCOUNT]
                [--no-ignore-nc-dnts] [--five-score-coords START END]
                [--bp-region-coords START END]
                [--three-score-coords START END]
                [-p PROCESSES] [--cv-processes CV_PROCESSES] [--streaming]
                [--clean-names] [--no-clean-names] [-u] [--no-abbreviate]
                [--abbreviate-filenames] [--no-headers] [--seed SEED]
                {train,classify,extract} ...

intronIC: Intron classification and extraction tool

positional arguments:
  {train,classify,extract,test}
                        Command to run (default: classify if not specified)
    train               Train a classifier on reference data only (no
                        genome/annotation needed)
    classify            Extract and classify introns from genome/annotation
    extract             Extract intron sequences without classification
    test                Run installation test with bundled test data

options:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  --quiet               Suppress non-essential output
  --debug               Enable debug logging
  --config CONFIG_PATH  Path to configuration file (auto-loads from standard
                        paths if not specified)
  --generate-config     Generate configuration file template and exit
  -n SPECIES_NAME, --species-name SPECIES_NAME, --species_name SPECIES_NAME
                        Species name for output files (e.g., homo_sapiens)
  -o OUTPUT_DIR, --output-dir OUTPUT_DIR, --output_dir OUTPUT_DIR
                        Output directory (default: current directory)
  --normalizer-mode {human,adaptive,auto}, --normalizer_mode {human,adaptive,auto}
                        Normalizer mode for pretrained model classification
                        (default: auto): human: Use scaler from training
                        species (recommended for U12-absent genomes) adaptive:
                        Refit scaler on experimental data (experimental, may
                        cause FPs in U12-free species) auto: Use human if
                        available in model, otherwise adaptive
  --species-prior PRIOR
                        Expected U12 prior for target species (0 to 1).
                        Adjusts classification probabilities via Bayes rule to
                        account for different U12 base rates. Recommended
                        values: - 0.005: Human-like species (default if not
                        specified) - 1e-6: U12-absent species (C. elegans,
                        many fungi) - 1e-4: U12-poor species Lower values
                        reduce false positives in U12-free lineages.
  --load-normalizer PATH
                        Load a saved normalizer (from a previous run, or any
                        compatible scaler pickle) and use it instead of
                        fitting one. Honored in both streaming and in-memory
                        modes. When passed, this overrides whatever scaler
                        the model bundle ships with and skips the per-input
                        adaptive fit. Typical use: pass --save-normalizer on
                        a first full-genome run, then --load-normalizer
                        <path>.normalizer.pkl on subsequent runs over subsets
                        of the same genome to keep z-scores consistent.
  --save-normalizer     Save the fitted normalizer for future runs (adaptive
                        mode only). Use this on your first full-genome run for
                        a species to establish a reference normalization.
                        Future runs can use --load-normalizer to reuse this
                        normalization. Saved to <output_prefix>.normalizer.pkl

input selection:
  Choose one mode: (1) -g + -a for annotation, (2) -g + -b for BED, or (3) -q for sequences

  -g GENOME, --genome GENOME
                        Path to genome FASTA file (required with -a or -b)
  -a ANNOTATION, --annotation ANNOTATION
                        Path to GFF3/GTF annotation file (requires -g)
  -b BED, --bed BED     Path to BED file with intron coordinates (requires -g)
  -q SEQUENCE_FILE, --sequence-file SEQUENCE_FILE, --sequence_file SEQUENCE_FILE
                        Path to pre-extracted intron sequences (.iic format)

model options:
  --model MODEL         Path to pretrained model (.model.pkl)
  --pretrained-model PRETRAINED_MODEL, --pretrained_model PRETRAINED_MODEL
                        (Deprecated: use --model) Path to pretrained model

extraction parameters:
  -f {cds,exon,both}, --feature {cds,exon,both}
                        Feature type to extract from (default: both)
  --min-intron-len MIN_INTRON_LEN, --min_intron_len MIN_INTRON_LEN
                        Minimum intron length (default: 30)
  -i, --allow-multiple-isoforms, --allow_multiple_isoforms
                        Include non-longest isoforms
  -v, --no-intron-overlap, --no_intron_overlap
                        Exclude overlapping introns
  -d, --include-duplicates, --include_duplicates
                        Include duplicate coordinate introns
  --flank-len FLANK_LEN, --flank_len FLANK_LEN
                        Exonic flank length (default: 100)
  --no-nc-ss-adjustment, --no_nc_ss_adjustment
                        Disable U12 boundary correction

scoring parameters:
  -t THRESHOLD, --threshold THRESHOLD
                        U12 probability threshold 0-100 (default: 90)
  --no-nc, --no_nc      Exclude non-canonical introns from scoring
  --pseudocount PSEUDOCOUNT
                        PWM pseudocount (default: 0.0001)
  --no-ignore-nc-dnts, --no_ignore_nc_dnts
                        Include terminal dinucleotides in non-canonical
                        scoring
  --five-score-coords START END, --five_score_coords START END
                        5' splice site region (default: -3 9)
  --bp-region-coords START END, --bp_region_coords START END
                        Branch point region (default: -55 -5)
  --three-score-coords START END, --three_score_coords START END
                        3' splice site region (default: -6 4)

performance options:
  -p PROCESSES, --processes PROCESSES
                        Parallel processes for scoring (default: 1)
  --cv-processes CV_PROCESSES, --cv_processes CV_PROCESSES
                        Processes for cross-validation (training only, default: same as -p)
  --in-memory, --no-streaming
                        Use in-memory mode: load full genome into memory. Higher memory
                        usage but may be slightly faster for very small genomes.
  --streaming           Use streaming mode (default): stores sequences in temp on-disk
                        storage, keeps only scoring motifs in memory. Roughly halves
                        peak RSS vs --in-memory (e.g., ~5 GB vs ~10 GB for full human at
                        -p 6). Bit-identical to in-memory.

output options:
  --clean-names, --clean_names
                        Remove "transcript:" and "gene:" prefixes (default:
                        True)
  --no-clean-names, --no_clean_names
                        Keep ID prefixes
  -u, --uninformative-naming, --uninformative_naming
                        Use simple naming scheme
  --no-abbreviate, --no_abbreviate, --na
                        Use full species name in outputs
  --abbreviate-filenames, --abbreviate_filenames, --afn
                        Abbreviate species name in filenames
  --no-headers, --no_headers
                        Omit column headers from output files (default:
                        include headers)

advanced options:
  --seed SEED           Random seed (default: 42)

Examples:
  # Test installation with bundled data
  intronIC test -p 4

  # Train a model on reference data (no genome needed!)
  intronIC train -n homo_sapiens

  # Classify introns with pretrained model (uses streaming mode by default)
  intronIC classify -g genome.fa -a annotation.gff -n species --model species.model.pkl

  # Extract intron sequences without classification
  intronIC extract -g genome.fa -a annotation.gff -n species

  # Parallel processing for faster analysis (streaming mode scales efficiently)
  intronIC classify -g genome.fa -a annotation.gff -n species -p 8

  # Use in-memory mode for very small genomes
  intronIC classify -g genome.fa -a annotation.gff -n species --no-streaming

  # Backward compatible (no subcommand = classify)
  intronIC -g genome.fa -a annotation.gff -n species --model species.model.pkl
⚠️ **GitHub.com Fallback** ⚠️