IV. Management of Genome Scale Data with Bioconductor - abishpius/R-for-Computational-Biology GitHub Wiki

Organizing Microarray Data with ExpressionSet

The ExpressionSet Class

library(Biobase)
library(GEOquery)

geoq <- getGEO("GSE9514")    # download a microarray dataset from GEO
names(geoq)    
e <- geoq[1](/abishpius/R-for-Computational-Biology/wiki/1)    # extract ExpressionSet
e

# exprs gives matrix of microarray values
dim(e)    # number of features and samples in ExpressionSet
ncol(e)
nrow(e)

exprs(e)[1:3,1:3]
head(exprs(e))[,1]    # first column
exprs(e)[1,]    # first row
exprs(e)["10000_at",]    # can also index by name
rownames(e)[1]    # row names are probe sets
dim(exprs(e))    # rows are features, columns are samples

# pData gives phenotype data (sample information)
pData(e)[1:3,1:6]
names(pData(e))
pData(e)$characteristics_ch1    # column in GEO to describe experimental state/condition
as.numeric(pData(e)$characteristics_ch1)    # help see replicates of each state
dim(pData(e))    # rows of pData correspond to columns of exprs
dim(e)

# fData gives feature data (probe information)
fData(e)[1:3,1:3]
dim(fData(e))    # rows of fData correspond to rows of exprs
names(fData(e))
head(fData(e)$"Gene Symbol")
head(rownames(e))

# additional annotation tied to ExpressionSet
experimentData(e)
annotation(e)

Reading Microarray Raw Data: Single-Color Arrays

# code edited to set your personal working directory
wd <- getwd()
datadir <- paste0(wd, "/rawdata-master")    # downloaded files
basedir <- paste0(datadir, "/celfiles")
setwd(basedir)
library(affy)
tab <- read.delim("sampleinfo.txt",check.names=FALSE,as.is=TRUE)
rownames(tab) <- tab$filenames
tab
fns <- list.celfiles(basedir)
fns
fns %in% tab[,1] ##check
ab <- ReadAffy(phenoData=tab)

dim(pm(ab))
dim(pData(ab))
rownames(ab)
colnames(pm(ab))
annotation(ab)

e <- rma(ab)    # preprocess probe-level data into gene-level data

ejust <- justRMA(filenames=tab[,1],phenoData=tab)    # read and process data to gene-level in one command
dim(ejust)

Reading Microarray Raw Data: Agilent Two-Color Array

# datadir defined in previous video
library(limma)
library(rafalib)
basedir - paste0(datadir, "/agilent")
setwd(basedir)
targets <- readTargets("TargetBeta7.txt")
RG <- read.maimages(targets$FileName, source="genepix")

MA <- MA.RG(RG,bc.method="none")
dim(RG$R)
dim(RG$G)
dim(MA$M)
dim(MA$A)
plot(MA$A[,1], MA$M[,1])    # MA plot for first sample

# microarray image
mypar(1,1)
imageplot(MA$M[,2], RG$printer, zlim=c(-3,3))
dev.off()

Organizing NGS Data with SummarizedExperiment

The SummarizedExperiment Class

library(parathyroidSE)
data(parathyroidGenesSE)
se <- parathyroidGenesSE
se

# assay contains results of the assay
dim(se)
assay(se)[1:3,1:3]
dim(assay(se))    # rows = features (ranges), columns = samples

# colData contains sample information
colData(se)[1:3,1:6]
dim(colData(se))
names(colData(se))
colData(se)$treatment
as.numeric(colData(se)$treatment)

# rowRanges contains feature information
rowRanges(se)[1]
class(rowRanges(se))
length(rowRanges(se))    # number of genes
length(rowRanges(se)[1](/abishpius/R-for-Computational-Biology/wiki/1))    # number of exons for first gene
head(rownames(se))
metadata(rowRanges(se))

# additional metadata, including sample information
metadata(se)$MIAME
abstract(metadata(se)$MIAME)

Importing NGS Data in R

There are two main Bioconductor packages for working with next generation sequencing (NGS) data in R: the Rsamtools package and the GenomicAlignments package. You can think of the difference as:

Rsamtools provides low-level functions for reading and parsing NGS data stored in standard formats (more details below)
GenomicAlignments provides high-level functions and classes for reading and organizing NGS data as Bioconductor objects based on the GRanges class Note that both packages work with aligned NGS data in BAM format, rather than raw NGS data in the FASTQ format.

The Rsamtools Package

The Rsamtools package description explains that this package: These utilities are some of the most commonly used tools for analyzing NGS data. However, they are implemented in C and most commonly used as command line tools. The Rsamtools package enables users to call on the functions in these utilities directly from R.

What are BAM files

You might not be familiar with all of the formats listed above, but the ones we are interested in for now are SAM and it's compressed form BAM. We will refer here to BAM files, because these are the kind of files which are kept most often because they are smaller (there is a SAMtools utility for converting SAM to BAM and BAM to SAM).

SAM and BAM files contain information about the alignment of NGS reads to a reference genome. These files are produced by alignment software, which take as input:

The FASTQ files from the sequencing machine (either 1 file for a single-end sequencing sample, or 2 files for a paired-end sequencing sample).
A genomic index, which is typically produced by special software packaged with the alignment software. The genomic index is created from the reference genome. Sometimes the genomic index files for popular reference genomes can be downloaded.

Note: alignment software is typically application specific. In particular, alignment programs for RNA-seq are different than those for genomic DNA sequencing, because in the case of the former, it is expected that reads might fall on exon-exon boundaries. The read will not contain intron sequence, because it is typically the mature, spliced mRNA which is converted to cDNA and sequenced, and the introns are already spliced out of this molecule. This is not a concern for genomic DNA sequencing.

How to import NGS data using Rsamtools

We will use example BAM files from the pasillaBamSubset package to examine the Rsamtools functions:

library(pasillaBamSubset)
library(Rsamtools)
filename <- untreated1_chr4()

We can create a BamFile object using the function BamFile(), which allows other functions to know how to process the file.

bf <- BamFile(filename)

We can ask about information on the chromosomes which are declared in the header of the BAM file:

seqinfo(bf)
sl <- seqlengths(bf)

A summary of the kind of alignments in the file can be generated:

quickBamFlagSummary(bf)

Specifying: what and which

A number of functions in Rsamtools take an argument param, which expects a ScanBamParam specification. There are full details available by looking up ?scanBamParam, but two important options are:

what - what kind of information to extract?
which - which ranges of alignments to extract?

BAM files are often paired with an index file (if not they can be indexed from R with indexBam), and so we can quickly pull out information about reads from a particular genomic range. Here we count the number of records (reads) on chromosome 4:

gr <- GRanges("chr4", IRanges(1, sl["chr4"]))
countBam(bf, param = ScanBamParam(which = gr))

We can pull out all the information with scanBam. Here, we specify a new BamFile, and use the yieldSize argument. This limits the number of reads which will be extracted to 5 at a time.

reads <- scanBam(BamFile(filename, yieldSize = 5))

Each time we call scanBam we will get 5 more reads, until there are no reads left. If we do not specify yieldSize we get all the reads at once. yieldSize is useful mainly for two reasons: (1) for limiting the number of reads at a time, for example, 1 or 2 million reads at a time, to keep within the memory limits of a given machine, say in the 5 GB range (2) or, for debugging, working through small examples while writing software.

Examining the output of scanBam

The reads object above is a list of lists. The outer list indexes over the ranges in the which command. Since we didn't specify which, here it is a list of length 1. The inner list contains different pieces of information from the BAM file. Since we didn't specify what, we get everything. See ?scanBam for the possible kinds of information to specify for what.

class(reads)
names(reads[1](/abishpius/R-for-Computational-Biology/wiki/1))
reads[1](/abishpius/R-for-Computational-Biology/wiki/1)$pos       # the aligned start position
reads[1](/abishpius/R-for-Computational-Biology/wiki/1)$rname     # the chromosome
reads[1](/abishpius/R-for-Computational-Biology/wiki/1)$strand    # the strand
reads[1](/abishpius/R-for-Computational-Biology/wiki/1)$qwidth    # the width of the read
reads[1](/abishpius/R-for-Computational-Biology/wiki/1)$seq       # the sequence of the read

Here we give an example of specifying what and which to specify what parameters to extract over which ranges:

gr <- GRanges("chr4", IRanges(500000, 700000))
reads <- scanBam(bf, param = ScanBamParam(what = c("pos","strand"), which = gr))

The GenomicAlignments package

The GenomicAlignments package description explains that this package:

"Provides efficient containers for storing and manipulating short genomic alignments (typically obtained by aligning short reads to a reference genome). This includes read counting, computing the coverage, junction detection, and working with the nucleotide content of the alignments."

This package defines the classes and functions which are used to represent genomic alignments in Bioconductor. Two of the most important functions in GenomicAlignments are:

readGAlignments() - this and other similarly named functions read data from BAM files
summarizeOverlaps() - this function simplifies counting reads in genomic ranges across one or more files The summarizeOverlaps() function is covered in more depth when we discuss read counting. summarizeOverlaps() is a function which wraps up other functions in GenomicAlignments for counting reads.

Here we will examine the output of the readGAlignments() function, continuing with the BAM file from the pasilla dataset:

library(GenomicAlignments)
ga <- readGAlignments(bf)
length(ga)

Note that we can extract the GRanges object within the GAlignments object, although we will see later that we can often work directly with the GAlignments object:

granges(ga[1])

Some of our familiar GenomicRanges functions work on GAlignments: we can use findOverlaps(), countOverlaps() and %over% directly on the GAlignments object. Note that location of ga and gr in the calls below:

gr <- GRanges("chr4", IRanges(700000, 800000))
(fo <- findOverlaps(ga, gr))    # which reads over this range
countOverlaps(gr, ga)    # count overlaps of range with the reads
table(ga %over% gr)    # logical vector of read overlaps with the range

If we had run countOverlaps(ga, gr) it would return an integer vector with the number of overlaps for each read with the range in gr.

Creating a Count Table from a BAM File

# libraries with BAM files and annotation
library(pasillaBamSubset)
library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene

grl <- exonsBy(txdb, by="gene")    # make GRangesList of exons for each gene
grl[100]    # GRangesList of exons for 100th gene
grl[100](/abishpius/R-for-Computational-Biology/wiki/100)    # GRanges with exons of 100th gene
grl[100](/abishpius/R-for-Computational-Biology/wiki/100)[1]    # first exon of 100th gene

# paths to BAM files
fl1 <- untreated1_chr4()
fl2 <- untreated3_chr4()

# libraries for importing BAM files
library(Rsamtools)
library(GenomicRanges)
library(GenomicAlignments)

# specify files with BamFileList
fls <- BamFileList(c(fl1, fl2))
names(fls) <- c("first","second")

# find reads that overlap exons
so1 <- summarizeOverlaps(features=grl,
                         reads=fls,
                         ignore.strand=TRUE)
so1

# examine count matrix
head(assay(so1))
colSums(assay(so1))

# examine rest of SummarizedExperiment components
rowRanges(so1)
colData(so1)
colData(so1)$sample <- c("one","two")    # add sample information
colData(so1)
metadata(rowRanges(so1)) 

# exploratory data analysis of counts
x <- assay(so1)[,1]
hist(x[x > 0], col="grey")
hist(x[x > 0 & x < 10000], col="grey")
plot(assay(so1) + 1, log="xy")

# count second file as paired-end reads
# ?untreated3_chr4
# ?summarizeOverlaps
fls <- BamFileList(fl2)
so2 <- summarizeOverlaps(features=grl,
                         reads=fls,
                         ignore.strand=TRUE,
                         singleEnd=FALSE, 
                         fragments=TRUE)
colSums(assay(so2))
colSums(assay(so1))

# show there are half as many reads in so2 as so1
plot(assay(so1)[,2], assay(so2)[,1], xlim=c(0,5000), ylim=c(0,5000),
     xlab="single end counting", ylab="paired end counting")
abline(0,1)
abline(0,.5)

Public Data Repositories

GEOquery: Accessing 1000s of microarray experiments at NCBI

library(GEOquery)
glioMA = getGEO("GSE78703")[1](/abishpius/R-for-Computational-Biology/wiki/1)
glioMA

names(pData(glioMA))    # variable names
glioMA$molecule_ch1    # molecule being assayed (RNA)
table(glioMA$`treated with:ch1`, glioMA$`cell type:ch1`)    # experimental variables

Array Oriented Notation

library(hgu133plus2.db)
hgu133plus2.db

library(hgu133plus2probe)
head(hgu133plus2probe)
dim(hgu133plus2probe)


select(hgu133plus2.db, keytype="PROBEID", 
       columns=c("SYMBOL", "GENENAME", "PATH", "MAP"), keys="1007_s_at")