ex402 - nibb-gitc/gitc2021sep-rnaseq GitHub Wiki

ex402: Differential expression analysis with edgeR

arab2データの遺伝子発現の2群間比較を、edgeRで行う。

arab2.txt は、ex301, ex302でもあつかった、シロイヌナズナRNA-seqのデータである。細菌の感染と非感染でトランスクリプトームを比較している。実験デザインに関する記述を原著論文から引用する。

Bacteria were grown in King’s B media and infiltrated into plants as previously described. Briefly, we used a syringe lacking a needle to infiltrate the abaxial side of leaves of six-week old Arabidopsis plants. Plants were infected with either the ∆hrcC mutant of Pseudomonas syringae pv. tomato DC3000 (PtoDC3000) or mock inoculated with 10 mM MgCl2 7 hpi. Each treatment was done as biological triplicates with each pair of replicates done at separate times and derived from independently grown plants and bacteria. -- Cumbie et al., 2011 PLOS ONE

つまり、6 libraries (2 groups x 3 biological replicates) のデータ。すでにマッピング済みで遺伝子毎のリードカウントがタブ区切りテキストとして提供されている。m1, m2, m3 が mock (control), h1, h2, h3 が　Pseudomonas感染。

Import data

> dat <- read.delim("arab2.txt", row.names=1)

ここまでは、ex301, ex302と同じ。

Import edgeR library

> library(edgeR)

2グループ、各3繰り返し実験、という実験デザインを定義する。（m1, m2, m3 が mock (control), h1, h2, h3 が　Pseudomonas感染。）

> grp <- c("M", "M", "M", "H", "H", "H")
> grp
[1] "M" "M" "M" "H" "H" "H"

edgeRのDGEList関数でカウントデータを読み込む。

> D <- DGEList(dat, group=grp)

Normalization

TMM法で、ノーマライズする。calcNormFactorsを使う。

> D <-calcNormFactors(D, method="TMM")

計算結果の確認

> D$samples
   group lib.size norm.factors
m1     M  1902032    1.0399197
m2     M  1934029    1.0611305
m3     M  3259705    0.8841923
h1     H  2129854    1.0266944
h2     H  1295304    1.1412144
h3     H  3526579    0.8747345

(Option) Normalize効果の可視化

> par(mfrow=c(1,2))
> boxplot(log10(dat + 1))    # before normalization
> boxplot(log10(cpm(D) + 1)) # after normalization

注）par(mfrow=c(1,2)) の効果を無効化するには、 dev.off()

DE testing

estimate dispersion

> D <- estimateCommonDisp(D)
> D$common.dispersion
[1] 0.3425927

> D <- estimateTagwiseDisp(D)
> summary(D$tagwise.dispersion)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0998  0.1661  0.4531  1.3929  2.3452  5.6635

(注：edgeRのバージョンで値は少し異なる。上記は、edgeR ver. 3.34.1 での結果)

DE test

> de.tagwise <- exactTest(D, pair=c("M", "H"))

Multiple comparison correction and View results

> topTags(de.tagwise)
Comparison of groups:  H-M 
             logFC    logCPM       PValue          FDR
AT2G19190 4.548463  7.389103 1.624163e-21 4.258718e-17
AT4G12500 4.391107 10.436948 1.998598e-20 2.451996e-16
AT2G44370 5.435565  5.212365 3.216775e-20 2.451996e-16
AT3G46280 4.925036  8.124281 4.407698e-20 2.451996e-16
AT5G48430 6.253282  6.720990 4.675633e-20 2.451996e-16
AT3G55150 5.779057  4.915687 7.776623e-20 3.398514e-16
AT2G39380 4.949922  5.787404 6.116835e-19 2.250632e-15
AT1G51820 4.387358  6.384947 6.866655e-19 2.250632e-15
AT4G12490 3.877082 10.199983 1.348539e-18 3.928894e-15
AT5G64120 3.755055  9.700967 1.557509e-18 4.083943e-15

(注：edgeRのバージョンで微妙に結果は変わる。上記は、edgeR ver. 3.34.1 での結果)

Dump the table into a text file

> write.table(de.tagwise$table, "de.tagwise.txt", sep="\t", quote=F)

もしくは、

> de.tagwise.sorted <- topTags(de.tagwise, n=nrow(de.tagwise$table))
> write.table(de.tagwise.sorted$table, "de.tagwise2.txt", sep="\t", quote=F)

後者はFDRの値も出力されるのでオススメ。

適切にファイルにデータが出力されたか確認する。MacやWindowsでは作業していたフォルダを開いて、de.tagwise2.txtというファイルができているかをチェックする。エディタや表計算ソフトウェアでオープンして中身を確認することもできる。UNIXコマンドラインからは、less de.tagwise2.txt などのコマンドで中身を確認することができる。

MA plot　を描いてみよう。

edgeR提供のplotSmear関数を使うと便利。

plotSmear(D)

有意に発現差のある遺伝子を赤色でハイライトすることもできる。

> de.names <- row.names(dat[decideTestsDGE(de.tagwise, p.value=0.05) !=0, ])
> plotSmear(D, de.tags=de.names)

Inspect DE result

example: fold-change > 10を抽出・カウント

## 準備
> de.tagwise.sorted <- topTags(de.tagwise, n=nrow(de.tagwise$table))
> detab <- de.tagwise.sorted$table
## get fold-change > 10
> detab[detab$logFC > log2(10),]
# => fold-change > 10のものが出力される。数だけ知りたい時は、
> nrow(detab[detab$logFC > log2(10),])
[1] 1062

example: FC > 5 AND FDR < 0.01

> detab[(detab$logFC > log2(5)  & detab$FDR < 0.01), ]

=> 127

edgeRに組み込まれている''decideTestDGE''関数も便利。上と同じ条件を満たす遺伝子を抽出するには、

> summary(decideTestsDGE(de.tagwise, p.value=0.01, lfc=log2(5)))
         H-M
Down      11
NotSig 26083
Up       127

dumpしたタブ区切りテキストをMS Excelで読み込んで、フィルタ機能やソート機能を使ってデータを探索するのも良いだろう。

Revision history

2021-0-11

Minor updates.

2021-3-9

Minor updates. R ver.: 4.0.4; edgeR ver.: 3.32.1