RNA Seq for Neurogenetics - labbcb/mapping GitHub Wiki

Locate directory where all files are located

setwd('~benilton/Dropbox/Curso4h')
pkgs <- c('Rsamtools', 'Rsubread', 'DESeq2', 'genefilter', 'gplots', 'doMC')
source('http://www.bioconductor.org/biocLite.R')
biocLite(pkgs)

Load packages of interest

library(Rsamtools)
library(Rsubread)

Pointing to a reference and building an index

reference <- 'Rattus_norvegicus.Rnor_5.0.75.dna.chromosome.1.fa'
buildindex('rn5index', reference)

Getting the FASTQ files

r1 <- list.files('fastq', pattern='r1.fastq', full=TRUE)
r2 <- list.files('fastq', pattern='r2.fastq', full=TRUE)

Quality Control

library(Rqc)
library(doMC)
registerDoMC(2)
qcr <- qc(c(r1, r2))

We save the report to the directory qc

url <- qcreport(qcr)
browseURL(url)

To simplify our work, we use the length of the vector r1 as a proxy to count the number of samples.

n <- length(r1)

Defining the location and names of the output files.

## Define filename
out <- basename(gsub('r1.fastq', 'bam', r1))
out
## Where to save
outdir <- 'bam'
dir.create(outdir)
out <- file.path(outdir, out)
out

Now, define the basename (without extension) of the sorted BAM files:

## The sorted files below will get the .bam extension
sorted <- gsub('bam$', 'sorted', out)

Here, we align using the Rsubread package (subread algorithm). At the end, each BAM file is sorted.

for (i in 1:n){
    align('rn5index', readfile1=r1[i], readfile2=r2[i],
          input_format='FASTQ', nthreads=2,
          output_format='BAM', output_file=out[i])
    sortBam(out[i], sorted[i])
}

We get the full names of the sorted files below. They are then indexed to ensure speed for access.

## Get the full names of the sorted files
sorted <- list.files('bam', pattern='sorted', full=TRUE)
indexBam(sorted)

Use the gene annotation given by the GTF file to count hits to exons and combine them into gene meta-features.

## Count how many hits we get
fc <- featureCounts(sorted,isGTFAnnotationFile=TRUE,
                        annot.ext='Rattus_norvegicus.Rnor_5.0.75.gtf',
                        nthreads=2, isPairedEnd=TRUE)

To visualize the structure of the count table:

## Remove all the 0-counts
zeroes <- rowSums(fc$counts == 0) == 6
counts <- fc$counts[!zeroes,]
head(counts[, c(3, 6)])

Let's observe LRP11:

## Lets see one: Lrp11
subset(fc$annotation, GeneID=='ENSRNOG00000014303')

The Analysis of Differential Expression Using Our Data

Following the aforementioned strategy, we create the count table, which is now loaded.

actual <- read.table('counts/matRN5.txt', sep='\t', header=TRUE, row.names=1)

We will use the DESeq2 and genefilter.

## Loading packages
library(DESeq2)
library(genefilter)

Let's observe what happens to the distribution of variances:

## Get the variances
vars <- log(rowVars(actual))
plot(density(vars), xlab='log-variance', ylab='density')
## NSF
i <- vars < 5

Create the table that contains phenotypic information.

## Phenotype information
pd <- data.frame(group=rep(c('control', 'pilo'), each=3),
                 lane=rep(c('lane1', 'lane2', 'lane3'), 2))
pd

We use the DESeqDataSetFromMatrix to load both our data and phenotypic information into a DESeq2-compatible object.

## Create object that will be used by DESeq
rawData <- DESeqDataSetFromMatrix(actual, pd, ~ group + lane)
dds <- DESeq(rawData)
resultsNames(dds)

The MA-plot is shown below.

## MAplot
plotMA(dds, 'group_pilo_vs_control')

And, now, genes that are candidate for differential expression.

## Genes candidatos a DE
res <- results(dds, 'group_pilo_vs_control')
## Converte para um formato mais amigavel
res <- as.data.frame(res)
## Selectiona apenas aqueles com FDR < 0.05
res <- subset(res, padj < 0.05)
## Ordenar por p-valor
res <- res[order(res$padj),]
head(res)

Some extra visualizations

rld <- rlogTransformation(dds, blind=TRUE)
plotPCA(rld, intgroup='group')
plotPCA(rld, intgroup='lane')