Data filtering notes - glarue/intronIC GitHub Wiki
Data filtering notes
-
When scoring introns,
intronIC
only processes introns with unique coordinates, and by default only includes introns from the longest isoform for a given gene. If run with-i
(to include multiple isoforms), introns with duplicate coordinates are still excluded; in such cases, introns from the longest isoform (computed as the sum of the component coding sequences) will be preferentially included over introns with identical coordinates from shorter isoforms. Duplicate introns may optionally be included in the sequences output file using-d
. -
There are a number of criteria by which introns may be omitted from the processed data, depending on run options. These introns will be included in the
bed.iic
andintrons.iic
files (and summarized inlog.iic
), tagged with[o:x]
in the intron label wherex
is one of the following:s
| short: Introns that are shorter than 30 nt (by default) cannot be scored, due to length requirements for the scored sub-sequences.n
| non-canonical: Introns without terminal dinucleotides in the set [GT-AG
,GC-AG
,AT-AC
] are excluded when run with--no_nc
a
| ambiguous characters: Introns with ambiguous characters (e.g. 'N') in scoring regions cannot be properly scored and are therefore excluded.i
| short isoform: If run without-i
, introns not present in the longest isoform are excluded.
-
Non-canonical introns with very strong U12-like 5′ motifs near their annotated start will have their start and stop coordinates corrected (by equal amounts) to reflect the more U12-like splicing boundaries. These introns are tagged with
[c:x]
, wherex
is the relative coordinate shift applied (the total number of corrected introns is also summarized inlog.iic
). Furthermore, the features defining such introns (e.g.CDS
orexon
) within the annotation will have their coordinates adjusted to reflect the new intron boundaries in theannotation.iic
output file.