Genomic data

Genomic data

Working with genomic data can be tedious and error prone. Genomic data can be organized as intervals [s,e) with integer coordinates s and e. Moreover, for many types of data, lists of such intervals are more appropriate. A genomic location is a chromosome, a strand and a list of genomic intervals. These are often associated with a data object (such as per-base numerical values, transcripts, sequencing reads, transcription factor binding site, ...).

Difficulties in working with genomic data mainly arise due to different coordinate systems used (e.g. zero- or one-based, along a transcript or on the genome), different genome assembly versions and because plus and minus strand have to be treated differently (due to the 5' to 3' directionality).

Here we introduce Gedi's functionality to do most of the hard work. Lists of intervals are represented by the interface GenomicRegion (and the basic implementing class ArrayGenomicRegion), providing many useful methods to compute with such entities (intersection, union, subtraction, extension, mapping between coordinate systems, ...).

Genomic data is represented by the interface ReferenceGenomicRegion (and their implementing classes MutableReferenceGenomicRegion and ImmutableReferenceGenomicRegion). These bundle a ReferenceSequence (which is a tuple of chromosome and strand information), a GenomicRegion and a data object and provide methods to do GenomicRegion operations consistent with 5' to 3' direction on both strands.

Pitfalls

Using the wrong genome assembly version

When working with genomic data, it is essential to be consistent with respect to the genome assembly. It is too easy to get confused with different versions as provided by the popular UCSC (University of California in Santa Cruz) genome browser or Ensembl, and when genomic data is loaded that has been created with different genome assemblies, you are in hell. As a solution, Gedi provides a single entry point for genomic data (the class Genomic), which has its own documentation.

Getting confused with the coordinate system

There are two popular ways how genomic intervals are defined: Either zero-based with start position inclusive and end position exclusive (as in bed files), or one-based with start and end position inclusive (as in gtf files). Internally, Gedi uses the former system, as this makes most operations on intervals easier, and all readers take care of converting these coordinates. Specifically, we define the following

Start: The zero-based inclusive start position, i.e. if an interval starts at the very first position, and includes this position, its start is 0.
End: The zero-based exclusive end position, i.e. if an interval has length 10 and starts at s, its end is s+10.
Stop: Is End-1 (i.e. the zero-based inclusive end position, which is the last position included in an interval).

In addition to these definitions, there is the genomic coordinate system, but also other coordinate systems e.g. refering to a position within an mRNA (which directly corresponds to a position on the genomic coordinate system).

The difficulties of regions on the - strand

As genomic regions (i.e. lists of intervals) are by themselves not specific for a strand of a chromosome, they always refer to the 5' to 3' direction of the plus strand. Naturally, if one wants to identify the transcription start sites of transcripts, one has to take the start of regions on the plus strand, and the stop of regions on the minus strand. In addition, when mapping positions within an mRNA to their corresponding genomic positions, one has to take care of the strand directionality (and the same is true when mRNA sequences must be extracted from the genome sequence).

Gedi takes care of all these difficulties by providing a small set of elementary operations that take care of handling these difficulties in a defined way, and that can be combined to achieve more complex goals in a straight-forward manner.

GenomicRegion

As GenomicRegion is supposed to be immutable (you can of course change that with a new implementing class), all operations return a new object.

Creating

There are two basic ways to create new GenomicRegion objects:

GenomicRegion reg1 = GenomicRegion.parse("10-20|35-60");
GenomicRegion reg2 = new ArrayGenomicRegion(25,50,55,65);

These are the resulting genomic locations, as visualized with StringUtils.display():

|0        |10       |20       |30       |40       |50       |60
          ##########---------------#########################
                         #########################-----##########

Retrieving information from a genomic region

reg1.getStart()       // 10
reg1.getEnd()         // 60
reg1.getStop()        // 59
reg1.getTotalLength() // 35

reg1.getNumParts()    // 2
reg1.getStart(1)      // 35
reg1.getEnd(1)        // 60
reg1.getStop(1)       // 59
reg1.getLength(1)     // 25
reg1.getIntronLength(0)  // 15

Retrieving intervals

reg1.getPart(1)                     // the Interval 35-60
reg1.getParts()                     // 2
reg1.iterator()                     // an ExtendedIterator over all parts (i.e. Interval objects)
reg1.getEnclosingPart(40)           // the Interval 35-60
reg1.getEnclosingPart(30)           // null
reg1.getEnclosingPartIndex(40)      // 1
reg1.getEnclosingPartIndex(30)      // -2 (-1 is before the region)
reg1.binarySearch(35)               // 2 (just like Arrays.binarySearch)
reg1.binarySearch(30)               // -3
reg1.binarySearch(40)               // -4

Manipulating a genomic region

In this section, the first genomic region always is reg1, the second the result of the manipulation.

StringUtils.display(reg1,reg1.removeIntrons())
|0        |10       |20       |30       |40       |50       |60
          ##########---------------#########################
          ##################################################

StringUtils.display(reg1,reg1.extendFront(5))
|0        |10       |20       |30       |40       |50       |60
          ##########---------------#########################
     ###############---------------#########################

StringUtils.display(reg1,reg1.extendBack(5))
|0        |10       |20       |30       |40       |50       |60
          ##########---------------#########################
          ##########---------------##############################

StringUtils.display(reg1,reg1.translate(-3))
|0        |10       |20       |30       |40       |50       |60
          ##########---------------#########################
       ##########---------------#########################

StringUtils.display(reg1,reg1.extendAll(2, 5))
|0        |10       |20       |30       |40       |50       |60
          ##########---------------#########################
        #################--------################################

StringUtils.display(reg1,reg1.invert()
|0        |10       |20       |30       |40       |50       |60
          ##########---------------#########################
                    ###############

StringUtils.display(reg1,reg1.invert(5,63))
|0        |10       |20       |30       |40       |50       |60
          ##########---------------#########################
     #####----------###############-------------------------###

StringUtils.display(reg1,reg1.reverse(65))
|0        |10       |20       |30       |40       |50       |60
          ##########---------------#########################
     #########################---------------##########

Computing with two genomic regions

In this section, the first two genomic region always are reg1 and reg2, the third the result of the manipulation.

StringUtils.display(reg1,reg2, reg1.union(reg2))
|0        |10       |20       |30       |40       |50       |60       
          ##########---------------#########################
                         #########################-----##########
          ##########-----########################################

StringUtils.display(reg1,reg2, reg1.intersect(reg2))
|0        |10       |20       |30       |40       |50       |60       
          ##########---------------#########################
                         #########################-----##########
                                   ###############-----#####

StringUtils.display(reg1,reg2, reg1.subtract(reg2))
|0        |10       |20       |30       |40       |50       |60       
          ##########---------------#########################
                         #########################-----##########
          ##########------------------------------#####

Testing positional properties of genomic regions

Here we need a few additional regions:

|0        |10       |20       |30       |40       |50       |60       
          ##########---------------#########################          reg1
                         #########################-----##########     reg2
                                        ##########                    reg3
     ###############---------------#####                              reg4
#####                                                                 reg5
          ##########---------------#####                              reg6

reg1.intersects(reg2)             // true
reg1.intersects(0,10)             // false
reg1.contains(20)                 // false
reg1.isIntronic(20)               // true
reg1.contains(reg2)               // false
reg1.contains(reg3)               // true
reg1.isIntronConsistent(reg2)     // false (if no exonic position of one is in an intron of the other)
reg1.isIntronConsistent(reg3)     // true
reg1.isIntronConsistent(reg4)     // true
reg1.isIntronConsistent(reg5)     // true
reg1.containsUnspliced(reg2)      // false (if intron consistent and contained!)
reg1.containsUnspliced(reg4)      // false
reg1.containsUnspliced(reg6)      // true

Inducing and mapping

Each genomic regions has an induced coordinate system, which is from 0 to, not including, its total length. The induce methods allow to map a position in the same coordinate system as the region to the induced coordinate system, the map methods do the inverse. These are extremely powerful operations and usable in many situations!

|0        |10       |20       |30       |40       |50       |60       
          ##########---------------#########################          reg1
              ######---------------#####                              inner

GenomicRegion ind = reg1.induce(inner)    // 4-15
reg1.induce(14)                           // 4
reg1.induce(4)                            // Exception
reg1.induceMaybeOutside(14)               // 4
reg1.induceMaybeOutside(4)                // -6

reg1.map(4)                               // 14
reg1.mapMaybeOutside(-6)                  // 4
reg1.map(ind)                             // 14-20|35-40 (=inner)

Extracting things contained in genomic regions

s=
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789
|0        |10       |20       |30       |40       |50       |60       
          ##########---------------#########################          reg1

reg1.extractSequence(s)                          // KLMNOPQRSTjklmnopqrstuvwxyz01234567
reg1.extractArray(EI.seq(0,60).toIntArray())     // an int array containing the elements 10 through 19 and 35 through 59 (there are overloaded methods to extract from double or generic arrays

ReferenceGenomicRegion

There are two implementations for genomic data objects:

MutableReferenceGenomicRegion: Has setters for its ReferenceSequence, GenomicRegion and data object; in addition, there are transformXXX methods for each of them, receiving a function (useful for fluent programming style); has a parse method to set reference and region (from a string like 1-:234325542-234326542)
ImmutableReferenceGenomicRegion: Has no setters or transform methods; has a static parse method

Never store MutableReferenceGenomicRegion except you know exactly what you are doing (as you can never be sure that its contents do not change).

One of the most useful things is that ReferenceGenomicRegion also provide map and induce methods, allowing to do these mappings in 5' to 3' direction also for minus strand locations.

Transcript

An important data object class is Transcript (e.g. used by annotations derived from Genomic). It has two string properties to store the gene and transcript ids, and the position of potential start and stop codons. Most useful are the methods get5Utr, getCds and get3Utr, that allow to extract these three parts (as GenomicRegion object) from a ReferenceGenomicRegion in safe way from both strands.

GenomicRegionStorage

The interface GenomicRegionStorage essentially is a collection of ReferenceGenomicRegion objects and, among other things, provides methods to retrieve all ReferenceGenomicRegion objects intersecting a given genomic location. The most important implementations are MemoryIntervalTreeStorage (everything is kept in interval trees in main memory), CenteredDiskIntervalTreeStorage (everything stays on hard disk organized in an interval tree optimized for hard disks) and BamGenomicRegionStorage (treat one or several bam files just like other GenomicRegionStorage objects).

The most important methods are overloaded methods named ei, to obtain an ExtendedIterator of all ReferenceGenomicRegion objects, of all from a given ReferenceSequence or of all intersecting a given ReferenceGenomicRegion.

In addition, there are several classes to read popular file formats of genomic data, such as BedFileReader, GtFileReader, GffFileReader, FimoFileReader, NarrowPeakFileReader, GenbankFileReader, that are based on generic readers for text files where intervals are either spread across multiple lines (as in gtf files) or all in one line (as in bed), making it straight-forward to implement readers for new file formats.

GenomicData - erhard-lab/gedi GitHub Wiki

Table of Contents

Genomic data

Pitfalls

Using the wrong genome assembly version

Getting confused with the coordinate system

The difficulties of regions on the - strand

GenomicRegion

Creating

Retrieving information from a genomic region

Retrieving intervals

Manipulating a genomic region

Computing with two genomic regions

Testing positional properties of genomic regions

Inducing and mapping

Extracting things contained in genomic regions

ReferenceGenomicRegion

Transcript

GenomicRegionStorage

⚠️ GitHub.com Fallback ⚠️

GenomicData - erhard-lab/gedi GitHub Wiki

Table of Contents

Genomic data

Pitfalls

Using the wrong genome assembly version

Getting confused with the coordinate system

The difficulties of regions on the - strand

GenomicRegion

Creating

Retrieving information from a genomic region

Retrieving intervals

Manipulating a genomic region

Computing with two genomic regions

Testing positional properties of genomic regions

Inducing and mapping

Extracting things contained in genomic regions

ReferenceGenomicRegion

Transcript

GenomicRegionStorage

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️