Training data and PWMs - glarue/intronIC GitHub Wiki

Training data and PWMs

Default data files can be found in the data directory. They consist of the following:

[u2, u12]_reference.introns.iic[.gz]

Intron sequences (including any flanking exon sequence required for scoring) to establish a score threshold. These should contain representative introns of both types for whatever species is being analyzed to produce the best results (although the default set should work well in many cases).

The default reference sets include introns found to be conserved across one or more groupings of species. In the U2-type set, all introns are conserved between human, mouse and zebrafish, and also between human, macaque and marmoset. The U12-type set contains human U12-type introns conserved in the previous groups, as well as between human, mouse, chicken and hagfish.

In addition, conserved human U12-type introns found in the U12DB (Alioto 2007) and SpliceRack (Sheth et al. 2006), and those with evidence of retention due to U12 spliceosome knockdown (Madan et al. 2015, Niemelä et al. 2014) are also included. The identifiers for each intron follow either the naming convention used by intronIC, or begin with the U12DB identifier, and each line ends with a tag indicating the source(s) of their supporting evidence. Multiple independent sources of evidence were required for introns to be included in the U12-type training set. See the files themselves for additional details.

Reference sequence formatting

The required tab-separated columns in the reference sequence files are:

  1. Name (or whitespace)
  2. Upstream flanking sequence (>= the number of upstream bases in the 5' scoring region)
  3. Intron sequence
  4. Downstream flanking sequence (>= the number of downstream bases in the 3' scoring region)

Some additional bookkeeping information is included in the default reference sequences; if supplying your own sequences, only the above columns are necessary.

HomSap-ENSG00000174444@ENST00000307961-intron_3(9)      99.715  GGATGTTCTGATGCTGTTTGCTACCATGCATAATAGTCATGGGTTGGTATGTATGATGAAGCAGCTACAACTTTTATCTCTT
CTCAATTTTAGGTCATCAGACTAGTGCTGAGTCTTGGGGTACTGGCAGAGCTGTGGCTCGAATTCCCAGAGTTCGAGGTGGTGGGACTCACCGCTCTGGCCAGGGTGCTTTTGGAAAC        GTATCCTTTGTTTCACTACC
TAAGACTGGTCATCTCTGATGGAATTTAGCGGCCTGGGTCTGGTTTATTGATGATAATACGGTGTAAAAATATTACTTTTTTTTGTCTTGAAGAGAAGGGGCTTCATTTATATGGGGTTATTTTGCTTGCAATGATGTCGTAATTT
GCGTCTTACTCTGTTCTCAGCGACAGTTGCCTGCTGTCAGTAAGCTGGTACAGAAGGTTGACGAAAATTCTTACTGAGCAAGAAATAACCTTGTTGTAATTACTAAAATTTGAGAAATGTGATTCTTGACTGGAAAAATAG     
ATGTGTCGTGGAGGCCGAATGTTTGCACCAACCAAAACCTGGCGCCGTTGGCATCGTAGAGTGAACACAACCCAAAAACGATACGCCATCTGTTCTGCCCTGGCTGCCTCAGCCCTACCAGCACTGGTCATGTCTAAAGGTTTGTA
ATACTTTATATAAGAGAGTTTGATAGAAGAAATAAGACACCTACTATTTGATCA        10.1093/nar/gku391,10.1261/rna.071423.119,HomSap-CalJac-MicMur,HomSap-MusMus-DanRer,
U12DB,human-mouse-chicken-hagfish   6

scoring_matrices.fasta.iic

A set of position-weight matrices (PWMs) representing different intron motifs for the scoring regions (five prime and branch point), e.g.:

>u12_atac_five  start=-20       (n=61)
A       C       G       T
0.262295082     0.2459016393    0.3278688525    0.1639344262
0.2295081967    0.262295082     0.1967213115    0.3114754098
0.2459016393    0.262295082     0.262295082     0.2295081967

The headers of the FASTA records must contain information about the type ("U2", "U12"), sub-type ("GTAG", "ATAC", etc.) and scoring region ("five", "bp", "three"). Order of these tags is not important.

The start= tag signifies the start location of the PWM (relative to the first, 0, or last, -1, base of the intron). For example, if the 5' scoring matrix begins with 5 nt of exonic sequence, the header should be tagged with start=-5. The first line of each entry must consist only of the order of the bases in the matrix (see default file for a clearer idea).

The default scoring PWMs are a combination of introns supported by multiple publications, as well as introns with clear conservation as U12-type based on internal analyses. Details of the filtering critera is documented within the default matrices file.

u2.conserved_empirical_bp_matrix.iic

A matrix derived from branch point sequences in conserved U2 introns (based on data from Pineda and Bradley 2018). Used in cases where the number of experimental introns is too low to construct a robust U2 branch point matrix.

Note: Any/all of the above files can be replaced as the defaults either permanently by modifying the existing files, or in one-off fashion using the --pwms and --r[2, 12] command line arguments