V. Genomic Annotation with Bioconductor - abishpius/R-for-Computational-Biology GitHub Wiki

Annotation of genes and transcripts

Representing reference sequence

Bioconductor includes many different types of genomic annotation. We can think of these annotation resources in a hierarchical structure.

At the base is the reference genomic sequence for an organism. This is always arranged into chromosomes, specified by linear sequences of nucleotides.
Above this is the organization of chromosomal sequence into regions of interest. The most prominent regions of interest are genes, but other structures like SNPs or CpG sites are annotated as well. Genes have internal structure, with parts that are transcribed and parts that are not, and “gene models” define the ways in which these structures are labeled and laid out in genomic coordinates.
- Within this concept of regions of interest we also identify platform-oriented annotation. This type of annotation is typically provided first by the manufacturer of an assay, but then refined as research identifies ambiguities or updates to initially declared roles for assay probe elements. The brainarray project at University of Michigan illustrates this process for affymetrix array annotation. We address this topic of platform-oriented annotation at the very end of this chapter.
Above this is the organization of regions (most often genes or gene products) into groups with shared structural or functional properties. Examples include pathways, groups of genes found together in cells, or identified as cooperating in biological processes.

Code: Discovering available reference genomes

library(BSgenome)
library(Biostrings)
ag = available.genomes()
grep("Scerev", ag, value=TRUE) #find yeast
grep("Hsap", ag, value=TRUE)  #find human

# inspect the human genome
library(BSgenome.Hsapiens.UCSC.hg19)
Hsapiens
length(Hsapiens)
class(Hsapiens)
methods(class="BSgenome") #find applicable functions

# inspect human genome
Hsapiens$chrX
substr(Hsapiens$chrX, 5e6, 5.1e6)
nchar(HSapiens$chrY)
nchar(Hsapiens[24](/abishpius/R-for-Computational-Biology/wiki/24))

Packages for gene and transcript catalogs

# import TxDb transcript database
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb = TxDb.Hsapiens.UCSC.hg19.knownGene  # abbreviate
class(txdb)
methods(class="TxDb")

# extract and inspect genes from TxDb
genes(txdb)
table(strand(genes(txdb)))
summary(width(genes(txdb)))

# inspect largest gene in genome
id = which.max(width(genes(txdb)))
genes(txdb)[id]
library(org.Hs.eg.db)
select(org.Hs.eg.db, keys="286297", keytype = "ENTREZID", columns = c("SYMBOL", "GENENAME"))

# compare total size of exons to total size of genes
ex = exons(txdb)
rex = reduce(ex)
ex_width = sum(width(rex))    # bases in exons
gene_width = sum(width(genes(txdb)))    # bases in genes
ex_width/gene_width

ensembldb, EnsDb: Annotation from EMBL

The ensembldb package provides functions to create and use transcript centric annotation databases/packages. The annotation for the databases are directly fetched from Ensembl 1 using their Perl API. The functionality and data is similar to that of the TxDb packages from the GenomicFeatures package, but, in addition to retrieve all gene/transcript models and annotations from the database, the ensembldb package provides also a filter framework allowing to retrieve annotations for specific entries like genes encoded on a chromosome region or transcript models of lincRNA genes. From version 1.7 on, EnsDb databases created by the ensembldb package contain also protein annotation data. For more information on the use of the protein annotations refer to the proteins vignette.

Code: Using ensembldb

# inspect data available from Ensembl
library(ensembldb)
library(EnsDb.Hsapiens.v75)
names(listTables(EnsDb.Hsapiens.v75))

# extract Ensembl transcripts
edb = EnsDb.Hsapiens.v75  # abbreviate
txs <- transcripts(edb, filter = GeneNameFilter("ZBTB16"),
                   columns = c("protein_id", "uniprot_id", "tx_biotype"))
txs

# compare Ensembl and UCSC transcripts
alltx = transcripts(edb)    # Ensembl is larger
utx = transcripts(txdb)    # UCSC is smaller

# table of biological types of transcripts
table(alltx$tx_biotype)

AnnotationHub, liftOver, and rtracklayer

AnnotationHub: finding and caching important annotation

AnnotationHub is a central hub for genomic annotation files maintained by the Bioconductor community. The resource includes annotation files from resources like UCSC, ENSEMBL, and the Broad Institute, and can be accessed using the AnnotationHub package. AnnotationHub allows you to search and download these resources from inside your R session.

library(AnnotationHub)
ah <- AnnotationHub()
ah

length(unique(ah$species))

ah_human <- subset(ah, species == "Homo sapiens")
ah_human

query(ah, "HepG2")
query(ah, c("HepG2", "H3K4me3"))
hepg2 <- query(ah, "HepG2")
hepg2_h3k4me3 <- query(hepg2, c("H3K4me3"))
hepg2_h3k4me3
hepg2_h3k4me3$tags

display(query(ah, "HepG2"))

e118_broadpeak <- query(hepg2_h3k4me3, c("E118", "broadPeak"))
id <- e118_broadpeak$ah_id
id

hepg2_h3k4me3_broad <- ah["AH29728"](/abishpius/R-for-Computational-Biology/wiki/"AH29728")
hepg2_h3k4me3_broad
alt_format <- ah[id](/abishpius/R-for-Computational-Biology/wiki/id)
identical(hepg2_h3k4me3_broad, alt_format)

liftOver: Translating between reference builds

Genomic annotations are typically defined for a fixed genome build. For human annotations, this is often hg19. When an analysis is performed using a different genome build, annotations must first be translated to the coordinates of the new build before they can be used. This process of translating between builds is called lifting. In Bioconductor, lifting is implemented in the liftOver() function of the rtracklayer package. This tutorial will walk you through using liftOver() to lift features from one genome build to another. In this example, we will move features from genome build hg38 to hg19.

# learn about liftOver from rtracklayer
library(rtracklayer)
?liftOver

# chromosome 1 gene locations in hg38
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
tx38 <- TxDb.Hsapiens.UCSC.hg38.knownGene
seqlevels(tx38, pruning.mode="coarse") = "chr1"
g1_38 <- genes(tx38)

# download the hg38 to hg19 chain file
library(AnnotationHub)
ah <- AnnotationHub()
ah.chain <- subset(ah, rdataclass == "ChainFile" & species == "Homo sapiens")
query(ah.chain, c("hg19", "hg38"))
ch <- ah["AH14108"](/abishpius/R-for-Computational-Biology/wiki/"AH14108")

# perform the liftOver
g1_19L <- liftOver(g1_38, ch)
g1_19L

Data import and export with rtracklayer

The rtracklayer package parses data into common formats so they can be easily used as annotation in future analysis.

library(devtools)
install_github("genomicsclass/ERBS")    # install ERBS package

f1 = dir(system.file("extdata",package="ERBS"), full=TRUE)[1]    # access data
readLines(f1, 4)    # look at a few lines

library(rtracklayer)
imp = import(f1, format="bedGraph")    # import as bedGraph format
imp

genome(imp)    # genome identifier tag not set, but you should set it
genome(imp) = "hg19"    # set genome
genome(imp)

export(imp, "demoex.bed")    # export as BED format  
cat(readLines("demoex.bed", n=5), sep="\n")    # check output file

Annotation Tools for Systems Biology

OrgDb: unified organism-specific annotation for systems biology

Packages named org.*.eg.db collect information at the gene level with links to location, protein product identifiers, KEGG pathway and GO terms, PMIDs of papers mentioning genes, and to identifiers for other annotation resources.

# load human OrgDb and inspect available keys
library(org.Hs.eg.db)
org.Hs.eg.db
keytypes(org.Hs.eg.db) # columns() gives same answer

# load GO.db and inspect available terms
library(GO.db)
allterms = keys(GO.db, keytype="TERM")
allterms[1:5]

# find GOID (gene ontology tag) for ribosome biogenesis
select(GO.db,  keys = "ribosome biogenesis", keytype="TERM", columns="GOID")

# find symbols for genes involved in ribosome biogenesis 
select(org.Hs.eg.db, keys = "GO:0042254", keytype="GO", columns="SYMBOL")

# you can pull out multiple columns at once
select(org.Hs.eg.db, keys = "GO:0042254", keytype="GO", columns=c("SYMBOL", "ENTREZID"))

# find gene ontology tags for related to ZNF658, which has the specified ENTREZID
select(org.Hs.eg.db, keys= "26149", keytype="ENTREZID", columns="GO")

# save GO tags to a character vector
select(org.Hs.eg.db, keys= "26149", keytype="ENTREZID", columns="GO")$"GO"
myk = .Last.value

# identify biological processes ZNF658 is involved in
select(GO.db, keys=myk, columns="TERM")

Using Kyoto Encyclopedia of Genes and Genomes (KEGG)

# load KEGGREST package and inspect organism-specific gene pathways
library(KEGGREST)
brca2K = keggGet("hsa:675")    # reference to a specific gene
names(brca2K[1](/abishpius/R-for-Computational-Biology/wiki/1))
brpat = keggGet("path:hsa05212")    # info on a pathway
brpat[1](/abishpius/R-for-Computational-Biology/wiki/1)$GENE[seq(1,132,2)] # entrez gene ids for pathway

# inspect some entrez ids
select(org.Hs.eg.db, keys="5888", keytype="ENTREZID", columns ="SYMBOL")
select(org.Hs.eg.db, keys="675", keytype="ENTREZID", columns ="SYMBOL")

# diagram showing structure of network
library(png)
library(grid)
brpng = keggGet("hsa05212", "image")
grid.raster(brpng)

Using EMBL Ontology Lookup Service with rols

library(rols)
oo = Ontologies()
oo
oo[1](/abishpius/R-for-Computational-Biology/wiki/1)

glis = OlsSearch("glioblastoma")
glis
res = olsSearch(glis)
resdf = as(res, "data.frame") # get content
resdf[1:4,1:4]
resdf[1,5]  # full description for one instance