Usage info - glarue/intronIC GitHub Wiki

Usage info

To get full usage information for intronIC, do intronIC --help, which will output the following:

usage: intronIC [-h] [-g GENOME] [-a ANNOTATION] [-b BED_FILE] -n SPECIES_NAME
                [-q SEQUENCE_FILE] [-f {cds,exon}] [-s] [--no_nc] [-i] [-v]
                [--pwms {PWM files)} [{PWM file(s} ...]]
                [--reference_u12s {reference U12 intron sequences}]
                [--reference_u2s {reference U2 intron sequences}] [--no_plot]
                [--format_info] [-d] [-u] [--no_abbreviate] [-t 0-100]
                [--no_sequence_output] [--five_score_coords start stop]
                [--three_score_coords start stop]
                [--branch_point_coords start stop]
                [-r {five,bp,three} [{five,bp,three} ...]]
                [--abbreviate_filenames] [--recursive]
                [--n_subsample N_SUBSAMPLE] [--cv_processes CV_PROCESSES]
                [-p PROCESSES] [--matrix_score_info] [-C HYPERPARAMETER_C]
                [--min_intron_len MIN_INTRON_LEN] [--pseudocount PSEUDOCOUNT]
                [--exons_as_flanks]

intronIC (intron Interrogator and Classifier) is a script which collects all
of the annotated introns found in a genome/annotation file pair, and produces
a variety of output files (*.iic) which describe the annotated introns and
(optionally) their similarity to known U12 sequences. Without the '-m' flag,
there MUST exist a matrix file in the 'intronIC_data' subdirectory in the same
parent directory as intronIC.py, with filename 'scoring_matrices.fasta.iic'.
In the same data directory, there must also be a pair of sequence files (see
--format_info) with reference intron sequences named '[u2,
u12]_reference_set.introns.iic'

optional arguments:
  -h, --help            show this help message and exit
  -f {cds,exon}, --feature {cds,exon}
                        Specify feature to use to define introns. By default,
                        intronIC will identify all introns uniquely defined by
                        both CDS and exon features. Under the default mode,
                        introns defined by exon features only will be
                        demarcated by an '[e]' tag (default: None)
  -s, --sequences_only  Bypass the scoring system and simply report the intron
                        sequences present in the annotations (default: False)
  --no_nc               Omit introns with non-canonical terminal dinucleoties
                        from scoring (default: False)
  -i, --allow_multiple_isoforms
                        Include non-duplicate introns from isoforms other than
                        the longest in the scored intron set (including those
                        with alt. splice boundaries unless also -v (default:
                        False)
  -v, --no_intron_overlap
                        Requires -i (or weird annotations). Exclude any
                        introns with boundaries that overlap other introns
                        from higher-priority transcripts (longer coding
                        length, etc.). This will exclude, for example, introns
                        with alternative 5′/3′ boundaries (default: False)
  --pwms {PWM file(s)} [{PWM file(s)} ...]
                        One or more PWMs to use in place of the defaults. Must
                        follow the formatting described by the --format_info
                        option (default: None)
  --reference_u12s {reference U12 intron sequences}, --r12 {reference U12 intron sequences}
                        introns.iic file with custom reference introns to be
                        used for setting U12 scoring expectation, including
                        flanking regions (default: None)
  --reference_u2s {reference U2 intron sequences}, --r2 {reference U2 intron sequences}
                        introns.iic file with custom reference introns to be
                        used for setting U12 scoring expectation, including
                        flanking regions (default: None)
  --no_plot             Do not output illustrations of intron
                        scores/distributions(plotting requires matplotlib)
                        (default: False)
  --format_info         Print information about the system files required by
                        this script (default: False)
  -d, --include_duplicates
                        Include introns with duplicate coordinates in the
                        intron seqs file (default: False)
  -u, --uninformative_naming
                        Use a simple naming scheme for introns instead of the
                        verbose, metadata-laden default format (default:
                        False)
  --no_abbreviate, --na
                        Use the provided species name in full within the
                        output files (default: False)
  -t 0-100, --threshold 0-100
                        Threshold value of the SVM-calculated probability of
                        being a U12 to determine output statistics (default:
                        90)
  --no_sequence_output, --ns
                        Do not create a file with the full intron sequences of
                        all annotated introns (default: False)
  --five_score_coords start stop, --5c start stop
                        Coordinates describing the 5' sequence to be scored,
                        relative to the 5' splice site (e.g. position 0 is the
                        first base of the intron); half-closed interval
                        [start, stop) (default: (-3, 9))
  --three_score_coords start stop, --3c start stop
                        Coordinates describing the 3' sequence to be scored,
                        relative to the 3' splice site (e.g. position -1 is
                        the last base of the intron); half-closed interval
                        (start, stop] (default: (-10, 4))
  --branch_point_coords start stop, --bpc start stop
                        Coordinates describing the region to search for branch
                        point sequences, relative to the 3' splice site (e.g.
                        position -1 is the last base of the intron); half-
                        closed interval [start, stop). (default: (-55, -5))
  -r {five,bp,three} [{five,bp,three} ...], --scoring_regions {five,bp,three} [{five,bp,three} ...]
                        Intron sequence regions to include in intron score
                        calculations. (default: ('five', 'bp'))
  --abbreviate_filenames, --afn
                        Use abbreviated species name when creating output
                        filenames. (default: False)
  --recursive           Generate new scoring matrices and training data using
                        confident U12s from the first scoring pass. This
                        option may produce better results in species distantly
                        related to the species upon which the training
                        data/matrices are based, though beware accidental
                        training on false positives. Recommended only in cases
                        where clear separation between types is seen with
                        default data. (default: False)
  --n_subsample N_SUBSAMPLE
                        Number of sub-samples to use to generate SVM
                        classifiers; 0 uses the entire training set and should
                        provide the best results; otherwise, higher values
                        will better approximate the entire set at the expense
                        of speed. (default: 0)
  --cv_processes CV_PROCESSES
                        Number of parallel processes to use during cross-
                        validation (default: None)
  -p PROCESSES, --processes PROCESSES
                        Number of parallel processes to use for scoring (and
                        cross-validation, unless --cv_processes is also set)
                        (default: 1)
  --matrix_score_info   Produce additional per-matrix raw score information
                        for each intron (default: False)
  -C HYPERPARAMETER_C, --hyperparameter_C HYPERPARAMETER_C
                        Provide the value for hyperparameter C directly
                        (bypasses optimized parameter search) (default: None)
  --min_intron_len MIN_INTRON_LEN
                        Minimum intron length to consider for scoring
                        (default: 30)
  --pseudocount PSEUDOCOUNT
                        Pseudocount value to add to each matrix value to avoid
                        0-div errors (default: 0.0001)
  --exons_as_flanks     Use entire up/downstream exonic sequence as flank
                        sequence in output (default: False)

required arguments (-g, [-a, -b] | -q):
  -g GENOME, --genome GENOME
                        Genome file in FASTA format (gzip compatible)
                        (default: None)
  -a ANNOTATION, --annotation ANNOTATION
                        Annotation file in gff/gff3/gtf format (gzip
                        compatible) (default: None)
  -b BED_FILE, --bed BED_FILE
                        Supply intron coordinates in BED format (default:
                        None)
  -n SPECIES_NAME, --species_name SPECIES_NAME
                        Binomial species name, used in output file and intron
                        label formatting. It is recommended to include at
                        least the first letter of the species, and the full
                        genus name since intronIC (by default) abbreviates the
                        provided name in its output (e.g. Homo_sapiens -->
                        HomSap) (default: None)
  -q SEQUENCE_FILE, --sequence_file SEQUENCE_FILE
                        Provide intron sequences directly, rather than using a
                        genome/annotation combination. Must follow the
                        introns.iic format (see README for description)
                        (default: None)