Output files - glarue/intronIC GitHub Wiki
Output files
intronIC
will automatically generate a set of output files. A brief description of the contents of each file follows (numbered lists represent the columns within each file).
For many use-cases, the most important files will be:
-
introns.iic
- full intron sequences, with U12-type probability score in 2nd column -
bed.iic
- BED file of intron coordinates, with U12-type probability score in 5th column -
meta.iic
- metadata file with intron info such as parent transcript/gene, oridinal index, phase, position as % of coding sequence, etc.
annotation.iic
If putatively misannotated introns are found, intronIC
will correct their coordinates by adjusting the features that define them in this file. These entries will contain a 'shift' tag that indicates the change made to either their start or stop coordinate (or both). This file is only made if misannotated introns are found (otherwise, it would be identical to the original annotation).
demoted.iic
NOTE: this option is disabled by default; see command line arguments for details
Scoring information for putative U12s whose scores fell below the threshold after their boundaries were switched to GT-AG (as a check to avoid scoring non-canonical introns as U12-type by way of superficial similarities to U12-type motifs):
- label
- initial score (with five and bp scores in parentheses), followed by reduced score after boundary switching
dupe_map.iic
A mapping table of unique, scored intron labels and their corresponding duplicate intron labels:
HomSap-gene-DIAPH2@rna-NM_006729.5-intron_23(26) HomSap-gene-DIAPH2@rna-NM_007309.4-intron_23(26);[o:i];[i]
- scored intron label
- duplicate intron label
introns.iic
All of the annotated intron sequences, including any introns not meeting scoring criteria (e.g. too short, including non-ATCG characters in scoring regions, etc.; includes duplicate introns if run with -d
):
HomSap-gene-DIAPH2@rna-NM_006729.5-intron_23(26) 100.0 TTTCAGCTCAAATTCTCAAGAGCAACCTTGCATCAATGGAACAACAAATTGTTCATCTGGAACGTGACATCAAGAAATTCCCCCAAGCAGAAAATCAACACGATAAGTTTGTGGAAAAGATGACC ATATCCTTTATTTAT[...]TTAACAAAAAGCTAC AGCTTTACAAAGACTGCCCGAGAACAGTATGAAAAACTCTCCACCATGCACAACAACATGATGAAGCTCTATGAGAATCTTGGAGAATACTTCATTTTTGACTCAAAGACAGTGAGCATAGAAGAGTTCTTTGGTGATCTCAACAACTTCCGAACTTTGTTTTTG
- label (without score tag)
- score (or '
.
' if run with-s
or otherwise unscored) - upstream (5′ exon) sequence (200 nt by default)
- intron sequence
- downstream (3′ exon) sequence (200 nt by default)
bed.iic
A BED format file of intron coordinates with U12 probabilty scores and labels:
NC_000023.11 97247840 97348115 HomSap-gene-DIAPH2@rna-NM_006729.5-intron_23(26);9.999% 99.99999999997891 +
- genomic region (e.g. 'chr1')
- start coordinate (0-indexed)
- end coordinate (1-indexed)
- label (including rounded score tag)
- U12-type probability score (0-100)
- strand
log.iic
A log of all of the information generated during operation, including total number of introns processed, excluded, etc., and total number of U12-type introns identified.
pwms.iic
A FASTA file of the PWMs used, including those built from the experimental dataset (not including pseudocount values).
meta.iic
A hodgepodge of other data about each intron (ordered by increasing U12 score):
HomSap-gene-DIAPH2@rna-NM_006729.5-intron_23(26) 10.0 AT-AC ACC|ATATCCTTTA...TGTTCCTTAACA/ATGTTCCTTAAC...GCTAC|AGC TGATTGATTGCCTTTAAAAGGTACTGTTGAGCCA[TGTTCCTTAACA]AAAAGCTAC 100276 rna-NM_006729.5 gene-DIAPH2 23 26 86.025 0 u12 cds
- label
- relative score based upon score threshold (U2 <= 0 < U12)
- terminal dinucleotides (e.g.
GT-AG
,AT-AC
) - motif string (5', U12/U2 BPS, 3')
- BPS location in context of 3' end
- length (bp)
- parent transcript
- parent gene
- ordinal position in transcript
- total introns in transcript
- fractional position in transcript as a percentage of the coding length, e.g. 50.0 for an intron that interrupts the coding sequence between codons 15 and 16 out of 30.
- phase (0, between codons; 1, after the first base of a codon; 2, after the second base of a codon)
- binary classification made by the classifier (
u12
oru2
), which may include introns of various probabilities within each class (i.e. introns labeled "u12" by the classifier may include introns with probabilities significantly lower than the specified threshold) - genomic feature used to define the intron (e.g.
cds
,exon
)
score_info.iic
Various scoring information (in order of increasing score).
HomSap-gene-DIAPH2@rna-NM_006729.5-intron_23(26) 9.999999999978911 99.99999999997891 14.009663556553722 ACCATATCCTTT 32.68142988959584 5.7780213971590415 TGTTCCTTAACA ATGTTCCTTAAC 12.230872477795534 3.1791686948829034 AGCTACAGCT 15.970632948856013 8.375420832149185
- intron label
- relative score (maximum precision); U2 <= 0 < U12
- SVM-assigned U12 probability score (0-100) (averaged across
N
SVM classifiers if--subsample_n N
) - 5′ sequence used for scoring
- 5′ log-ratio score
- 5′ z-score
- U12-type branch point sequence used for scoring
- U2-type branch point sequence used for scoring
- branch point log-ratio score
- branch point z-score
- 3' sequence used for scoring
- 3' log-ratio score
- 3' z-score
- distance from hyperplane, e.g. the raw classifier output prior to scikit-learn's implementation of Platt scaling to convert distances to probabilities.