Data filtering notes - glarue/intronIC GitHub Wiki

Data filtering notes

  • When scoring introns, intronIC only processes introns with unique coordinates, and by default only includes introns from the longest isoform for a given gene. If run with -i (to include multiple isoforms), introns with duplicate coordinates are still excluded; in such cases, introns from the longest isoform (computed as the sum of the component coding sequences) will be preferentially included over introns with identical coordinates from shorter isoforms. Duplicate introns may optionally be included in the sequences output file using -d.

  • There are a number of criteria by which introns may be omitted from the processed data, depending on run options. These introns will be included in the bed.iic and introns.iic files (and summarized in log.iic), tagged with [o:x] in the intron label where x is one of the following:

    • s | short: Introns that are shorter than 30 nt (by default) cannot be scored, due to length requirements for the scored sub-sequences.
    • n | non-canonical: Introns without terminal dinucleotides in the set [GT-AG, GC-AG, AT-AC] are excluded when run with --no_nc
    • a | ambiguous characters: Introns with ambiguous characters (e.g. 'N') in scoring regions cannot be properly scored and are therefore excluded.
    • i | short isoform: If run without -i, introns not present in the longest isoform are excluded.
  • Non-canonical introns with very strong U12-like 5′ motifs near their annotated start will have their start and stop coordinates corrected (by equal amounts) to reflect the more U12-like splicing boundaries. These introns are tagged with [c:x], where x is the relative coordinate shift applied (the total number of corrected introns is also summarized in log.iic). Furthermore, the features defining such introns (e.g. CDS or exon) within the annotation will have their coordinates adjusted to reflect the new intron boundaries in the annotation.iic output file.