Genomic - erhard-lab/gedi GitHub Wiki

Table of Contents

Genomic

In Gedi, there is a central class that handeles everyhing that has to do with a specific genome. This includes

  • The genome sequence (random access to it, i.e. give me the sequence on chromosome X from positions A to B)
  • The annotation (all transcripts stored in a GenomicRegionStorage)
  • Mapping tables (e.g. from gene id to gene symbol)
  • Access to things like the bowtie or STAR index

Everything is evaluated in a lazy fashion (e.g. tables are only loaded when needed), and by caching everything (anything that has been loaded once, must not be loaded again; this makes sense as only indices are loaded into main memory).

Creating a genome index

New genome indices accessible via the class Genomic are created using the command line tool IndexGenome. There are three ways to automatically generate references:

  1. Using a fasta file and a gtf file (as e.g. obtained from Ensembl)

    gedi -e IndexGenome -s Homo_sapiens.GRCh38.dna.primary_assembly.fa -a Homo_sapiens.GRCh38.86.gtf -ignoreMulti -n homo_ens86

    The -ignoreMulti is often necessary when the gtf file contains two or more identical transcripts with different names. The -n parameter specifies a short name to access this reference (otherwise it would be called Homo_sapiens.GRCh38.86).

  2. Using a genbank file

    gedi -e IndexGenome -gb NC_006273.gb -gblabel gene

    The -gblabel argument determines the name of the genbank feature that is used as the gene name (as there is no standard).

  3. Using only a fasta file.

    gedi -e IndexGenome -s rRNA.fa

    This is particularly useful for a compiled rRNA database. All indexing steps refering to an annotation are skipped.

By default, an oml file referencing all generated files is put into $HOME/.gedi/genomic. All files within this folder can be accessed by name (instead of a full path) whenever a prepared reference is needed.

The complete usage of this tool is:

gedi -e IndexGenome <Options>

Options:
 -s <fasta-file>        Fasta file containing chromosomes
 -a <gtf-file>          Gtf file containing annotation
 -ignoreMulti           Ignore identical transcripts in Gtf file
 -gb <genbank-file>     Genbank file containing annotation and sequence
 -gblabel <label>       Which genbank entry to take as gene and transcript label (default: label)
 -f <folder>            Output folder (Default: next to Fasta and Gtf / genbank)
 -n <name>              Name of the genome for later use (Default: file name of gtf/genbank-file)
 -o <file>              Specify output file (Default: ~/.gedi/genomic/${name}.oml)
 -nobowtie              Do not create bowtie indices
 -nostar                Do not create STAR indices
 -nokallisto            Do not create kallisto indices
 -p                     Show progress
 -h                     Show this message
 -D                     Output debugging information

Background

Eukaryotic genomes (and also genomes of double stranded DNA viruses such as herpesviruses) consist of one or several chromosomes. A chromosome has two strands, often referred to as Watson and Crick strands, or Plus (+) and Minus (-) strand. Each strand of a chromosome has a sequence (of its nucleobases in 5' to 3' direction. Due to base pairing and the antiparallelism of the two strands, the - sequence is the reverse complement of the + strand. In Gedi, the tuple composed of a chromosome and a strand (+,- or unspecific) is called a ReferenceSequence (e.g. 1+ or 12-).

On the chromosomes, there are genetic elements. They either belong to a specific strand (such as genes, transcripts), or are strand unspecific (e.g. transcription factor binding sites). Thus, a genetic element belongs to a specific ReferenceSequence. By convention, strand specific elements such as transcripts are assigned to their sense strand, i.e. the strand that has the same sequence as the corresponding mRNA (and not the strand from which they are transcribed).

A genetic element is a collection of intervals on its ReferenceSequence. It may just be a collection of a single interval (e.g. transcription factor binding site) or several intervals (e.g. the exons of a transcript). In Gedi, this collection is called a GenomicRegion, and the intervals are specified using 0 based coordinates, with start inclusive and end exclusive (e.g. 10-20 or 120-140|180-190).

Basic usage

A Genomic object can be obtained using

Genomic g = Genomic.get("<id>");

Here, <id> is a name given via -n to IndexGenome. It is also possible to obtain a combined genome:

Genomic g = Genomic.get("h.ens86","JN555585");

This is for instance most useful to have a virus and a host genome accessible via a single object.

Getting transcripts

Obtaining a GenomicRegionStorage containing all transcripts is easy:

GenomicRegionStorage<Transcript> storage = g.getTranscripts();

As you might know, a GenomicRegionStorage contains GenomicRegion objects for several ReferenceSequences (usually chromosomes) along with a specific data object (Transcripts here).

All transcripts

To obtain all transcripts from storage, just do:

ExtendedIterator<ImmutableReferenceGenomicRegion<Transcript>> ei = storage.ei();

ImmutableReferenceGenomicRegion objects bundle the reference sequence, the genomic region and the data object of a storage, and the ei() method returns an ExtendedIterator over all transcripts in the storage, orderd according to reference and region.

All transcripts intersecting a specified region on the genome

It is also possible to obtain an iterator just for one strand of a specific reference sequence, or only over transcripts that intersect a specified region:

// all for a given reference sequence ref
ExtendedIterator<ImmutableReferenceGenomicRegion<Transcript>> ei = storage.ei(ref);

// the same, but given by name
ExtendedIterator<ImmutableReferenceGenomicRegion<Transcript>> ei = storage.ei("1+");

// all intersecting a specific region on a given reference sequence ref
ExtendedIterator<ImmutableReferenceGenomicRegion<Transcript>> ei = storage.ei(ref,region);

// the same, but given by name
ExtendedIterator<ImmutableReferenceGenomicRegion<Transcript>> ei = storage.ei("1+:1000000-2000000");

Transcripts by identifier

ReferenceGenomicRegions can also be obtained for a given transcript id:

g.getTranscriptMapping().apply("ENST00000456328")

Obtaining a sequence

The basic way of obtaining a sequence from the genome is

g.getSequence("1+:1000000-2000000");

The methods return sequences in 5' to 3' direction. Thus for

g.getSequence("1-:1000000-2000000");

first, the sequence from the + strand is obtained, reversed and complemented. There are also methods to access the sequence from ReferenceSequence and GenomicRegion objects, or from a ReferenceGenomicRegion object.

Mapping

Genetic elements have additional information associated. These can be accessed using

g.getTranscriptTable("biotype").apply("ENST00000456328")
⚠️ **GitHub.com Fallback** ⚠️