Example usage - glarue/intronIC GitHub Wiki
Example usage
-
If you have installed via
pip
, first download the chromosome 19 FASTA and GFF3 sample files into a directory of your choice. -
If you have cloned the repo, first change to the
/intronIC/intronIC/test_data
subdirectory, which contains Ensembl annotations and sequence for chromosome 19 of the human genome. ReplaceintronIC
with../intronIC.py
in the following examples.
To collect and classify all (non-redundant) annotated introns, do the following:
$ intronIC -g Homo_sapiens.Chr19.Ensembl_91.fa.gz -a Homo_sapiens.Chr19.Ensembl_91.gff3.gz -n homo_sapiens
Information about the run will be printed to the screen; this same information (plus some additional details) can be found in the log.iic
file:
[#] Starting run on [homo_sapiens (HomSap)]
[#] Run command: [/home/glarue/Documents/Coding/Python/Research/intronIC/intronIC -g /home/glarue/Documents/Coding/Python/Research/intronIC/test_data/Homo_sapiens.Chr19.Ensembl_91.fa.gz -a /home/glarue/Documents/Coding/Python/Research/intronIC/test_data/Homo_sapiens.Chr19.Ensembl_91.gff3.gz -n homo_sapiens]
[#] Using [cds,exon] features to define introns
[#] [58933] introns found in [Homo_sapiens.Chr19.Ensembl_91.gff3.gz]
[#] [38681] introns with duplicate coordinates excluded
[#] [8178] introns omitted from scoring based on the following criteria:
[#] * short (<30 nt): 66
[#] * ambiguous nucleotides in scoring regions: 0
[#] * non-canonical boundaries: 0
[#] * overlapping coordinates: 0
[#] * not in longest isoform: 8112
[#] Most common non-canonical splice sites:
[#] * AT-AG (17/328, 5.18%)
[#] * GT-TG (12/328, 3.66%)
[#] * GG-AG (12/328, 3.66%)
[#] * GA-AG (11/328, 3.35%)
[#] * AG-AG (10/328, 3.05%)
[#] [12] ([3] unique, [9] redundant) putatively misannotated U12 introns corrected in [homo_sapiens.annotation.iic]
[#] [12074] introns included in scoring analysis
[#] [11272] introns used to build U2 branch point matrix (5'SS in bottom [95]th percentile)
[#] Scoring introns using the following regions: [five, bp]
[#] Raw scores calculated for [20689] U2 and [387] U12 reference introns
[#] Raw scores calculated for [12074] experimental introns
[#] Non-redundant training sets: [20556] U2, [387] U12
[#] Training SVM using reference data
Starting optimization round 1/5
Starting optimization round 2/5
Starting optimization round 3/5
Starting optimization round 4/5
Starting optimization round 5/5
[#] Range for 'C' after [5] rounds of optimization: [976.5411685881514]-[976.5419176464368]
[#] Set classifier value for 'C': [976.5415431172212]
[#] Training classifier with optimized hyperparameters
[#] Average classifier performance on training data:
F1 [1.0]
P-R AUC [1.0]
[#] Classifier performance details:
precision recall f1-score support
U2 1.00 1.00 1.00 4112
U12 1.00 1.00 1.00 77
accuracy 1.00 4189
macro avg 1.00 1.00 1.00 4189
weighted avg 1.00 1.00 1.00 4189
[#] [1] putative U12 scores were not robust to boundary switching
[#] [10] putative AT-AC U12 introns found.
[#] [31] putative U12 introns found with scores > [90]%
[#] Adding scores to intron sequences file
[#] Generating figures
[#] Run finished in [7.161 minutes]
If only the intron sequences are desired, scoring can be bypassed using the -s
flag which will significantly reduce the processing time and produce only a subset of the output files:
$ intronIC -g /home/glarue/Documents/Coding/Python/Research/intronIC/test_data/Homo_sapiens.Chr19.Ensembl_91.fa.gz -a /home/glarue/Documents/Coding/Python/Research/intronIC/test_data/Homo_sapiens.Chr19.Ensembl_91.gff3.gz -n homo_sapiens -s
[#] Starting run on [homo_sapiens (HomSap)]
[#] Run command: [/home/glarue/Documents/Coding/Python/Research/intronIC/intronIC -g /home/glarue/Documents/Coding/Python/Research/intronIC/test_data/Homo_sapiens.Chr19.Ensembl_91.fa.gz -a /home/glarue/Documents/Coding/Python/Research/intronIC/test_data/Homo_sapiens.Chr19.Ensembl_91.gff3.gz -n homo_sapiens -s]
[#] Using [cds,exon] features to define introns
[#] [58933] introns found in [/home/glarue/Documents/Coding/Python/Research/intronIC/test_data/Homo_sapiens.Chr19.Ensembl_91.gff3.gz]
[#] [38681] introns with duplicate coordinates excluded
[#] [20252] intron sequences written to [homo_sapiens.introns.iic]
[#] Run finished in [27.41 seconds]
Many additional options exist for a variety of use cases. Run intronIC --help
for additional details and/or see the Usage info page.