Example usage - glarue/intronIC GitHub Wiki

Example usage

  • If you have installed via pip, first download the chromosome 19 FASTA and GFF3 sample files into a directory of your choice.

  • If you have cloned the repo, first change to the /intronIC/intronIC/test_data subdirectory, which contains Ensembl annotations and sequence for chromosome 19 of the human genome. Replace intronIC with ../intronIC.py in the following examples.

To collect and classify all (non-redundant) annotated introns, do the following:

$ intronIC -g Homo_sapiens.Chr19.Ensembl_91.fa.gz -a Homo_sapiens.Chr19.Ensembl_91.gff3.gz -n homo_sapiens

Information about the run will be printed to the screen; this same information (plus some additional details) can be found in the log.iic file:

[#] Starting run on [homo_sapiens (HomSap)]
[#] Run command: [/home/glarue/Documents/Coding/Python/Research/intronIC/intronIC -g /home/glarue/Documents/Coding/Python/Research/intronIC/test_data/Homo_sapiens.Chr19.Ensembl_91.fa.gz -a /home/glarue/Documents/Coding/Python/Research/intronIC/test_data/Homo_sapiens.Chr19.Ensembl_91.gff3.gz -n homo_sapiens]
[#] Using [cds,exon] features to define introns
[#] [58933] introns found in [Homo_sapiens.Chr19.Ensembl_91.gff3.gz]
[#] [38681] introns with duplicate coordinates excluded
[#] [8178] introns omitted from scoring based on the following criteria:
[#] * short (<30 nt): 66
[#] * ambiguous nucleotides in scoring regions: 0
[#] * non-canonical boundaries: 0
[#] * overlapping coordinates: 0
[#] * not in longest isoform: 8112
[#] Most common non-canonical splice sites:
[#] * AT-AG (17/328, 5.18%)
[#] * GT-TG (12/328, 3.66%)
[#] * GG-AG (12/328, 3.66%)
[#] * GA-AG (11/328, 3.35%)
[#] * AG-AG (10/328, 3.05%)
[#] [12] ([3] unique, [9] redundant) putatively misannotated U12 introns corrected in [homo_sapiens.annotation.iic]
[#] [12074] introns included in scoring analysis
[#] [11272] introns used to build U2 branch point matrix (5'SS in bottom [95]th percentile)
[#] Scoring introns using the following regions: [five, bp]
[#] Raw scores calculated for [20689] U2 and [387] U12 reference introns
[#] Raw scores calculated for [12074] experimental introns
[#] Non-redundant training sets: [20556] U2, [387] U12
[#] Training SVM using reference data
Starting optimization round 1/5
Starting optimization round 2/5
Starting optimization round 3/5
Starting optimization round 4/5
Starting optimization round 5/5
[#] Range for 'C' after [5] rounds of optimization: [976.5411685881514]-[976.5419176464368]
[#] Set classifier value for 'C': [976.5415431172212]
[#] Training classifier with optimized hyperparameters
[#] Average classifier performance on training data:
	F1	[1.0]
	P-R AUC	[1.0]
[#] Classifier performance details:
              precision    recall  f1-score   support

          U2       1.00      1.00      1.00      4112
         U12       1.00      1.00      1.00        77

    accuracy                           1.00      4189
   macro avg       1.00      1.00      1.00      4189
weighted avg       1.00      1.00      1.00      4189

[#] [1] putative U12 scores were not robust to boundary switching
[#] [10] putative AT-AC U12 introns found.
[#] [31] putative U12 introns found with scores > [90]%
[#] Adding scores to intron sequences file
[#] Generating figures
[#] Run finished in [7.161 minutes]

If only the intron sequences are desired, scoring can be bypassed using the -s flag which will significantly reduce the processing time and produce only a subset of the output files:

$ intronIC -g /home/glarue/Documents/Coding/Python/Research/intronIC/test_data/Homo_sapiens.Chr19.Ensembl_91.fa.gz -a /home/glarue/Documents/Coding/Python/Research/intronIC/test_data/Homo_sapiens.Chr19.Ensembl_91.gff3.gz -n homo_sapiens -s
[#] Starting run on [homo_sapiens (HomSap)]
[#] Run command: [/home/glarue/Documents/Coding/Python/Research/intronIC/intronIC -g /home/glarue/Documents/Coding/Python/Research/intronIC/test_data/Homo_sapiens.Chr19.Ensembl_91.fa.gz -a /home/glarue/Documents/Coding/Python/Research/intronIC/test_data/Homo_sapiens.Chr19.Ensembl_91.gff3.gz -n homo_sapiens -s]
[#] Using [cds,exon] features to define introns
[#] [58933] introns found in [/home/glarue/Documents/Coding/Python/Research/intronIC/test_data/Homo_sapiens.Chr19.Ensembl_91.gff3.gz]
[#] [38681] introns with duplicate coordinates excluded
[#] [20252] intron sequences written to [homo_sapiens.introns.iic]
[#] Run finished in [27.41 seconds]

Many additional options exist for a variety of use cases. Run intronIC --help for additional details and/or see the Usage info page.