ex3 - nibb-unix/gitc202502-unix GitHub Wiki

復習問題３　統計学入門

1. 以下のシミュレートしたデータを使って、線形モデルのあてはめを行ってみる。

generate.simulated.data <- function(sigma){
  wtValue         <- 3
  strain1Value    <- 5
  strain2Value    <- 7
  replicate1value <- 2
  replicate2value <- 5
  replicate3value <- 1
  replicateValues <- c(replicate1value, replicate2value, replicate3value)
  residualValues  <- rnorm(9, 0, sigma)
  
  dataMatrix <- cbind(
    "genotype"=rep(c(wtValue, strain1Value, strain2Value), each=3),
    "replicate"=rep(replicateValues, 3),
    "residual"=residualValues
  )
  
  dataMatrix <- cbind(dataMatrix, "observed.value"=rowSums(dataMatrix))
  
  dataFrame <- data.frame(
    cbind("strain"=rep(c("control", "strain1", "strain2"), each=3), 
          "rep"=rep(1:3, 3), 
          "value"=dataMatrix[,"observed.value"])
  )
  
  dataFrame$value <- as.numeric(as.vector(dataFrame$value))
  list("dataMatrix"=dataMatrix, "dataFrame"=dataFrame)
}

simulatedData <- generate.simulated.data(0.5)
dataMatrix    <- simulatedData$dataMatrix
dataFrame     <- simulatedData$dataFrame

問1-1. 生成されたデータdataMatrixの内容を確認し、barplot関数でstrain、repおよび残差で観察値observed.valueを構成しているかを確認しなさい（ヒント：行列の転置が必要です）。
問1-2.dataFrame、lm関数を使ってstrainを説明変数、valueを応答変数とする線形モデルのあてはめを行い、推定された係数を確認しなさい。
問1-3. 同様にstrainとrepを説明変数とする線形モデルのあてはめを行い、推定された係数を確認しなさい。

2. t検定における帰無仮説とp値をRを使って確認してみよう。

以下の実験を考えてみる。

ショウジョウバエをバイアルに200匹準備し、その中から３匹を使って測定を行う。２つのバイアルの両方に野生型を準備し、平均値を比較した場合には有意な差は観察されないはずである。しかし、p値とは「帰無仮説（この場合は２本のバイアルのショウジョウバエからの測定値の平均に差がない）のもと、得られた統計量が観察される確率」であり、多くの場合、p値<0.05の場合に２群に有意な差があったと考えられている。同じ集団を比較することにより、帰無仮説の下、観察されるp値を実験してみる。

以下のRコードで帰無仮説のもと得られるp値を1000個を取得する。

prep.and.select.samples <- function(N, n){
  replicatePopulation <- rnorm(N, 0, 1)
  sample(replicatePopulation, n)
}

generate.null.t.test <- function(N, n){
  t.test(x=prep.and.select.samples(N, n), y=prep.and.select.samples(N, n), 
  var.equal=TRUE, lower.tail=FALSE)$p.value
}

nullPvalues <- sapply(1:1000, function(x) generate.null.t.test(200,3))

問2-1. nullPvaluesのヒストグラムを描き、この帰無仮説のもとのp値の分布を求めなさい。
問2-2. nullPvaluesのうち、0.05以下の検定はいくつあったか？　length, which等を使って確認しなさい。
問2-3. このシミュレーションを再度行った際には同様な結果が得られるか？確認しなさい。

3. 以下はシロイヌナズナのトランスクリプトームデータからコサイン係数（類似度）を求め、あるコサイン係数が統計的にも有意かを調べる例です。

(コサイン係数については実践編の「多変量解析」の資料を参照）相関係数が0ではないことを検定する方法は別にありますが、この例では「このデータが取りうるランダムな相関係数とは有意に異なるか」の２つの帰無仮説について調べます。

3-1: データ中のある遺伝子の発現プロファイルと他のある遺伝子の発現プロファイルのコサイン係数が、データ内の数値から偶然得られる値とは有意に異なるかを_p=0.05_を基準に評価する。

library(coop)

## function

# random sampling of values across samples:
# check whether or not observed correlation value can be observed in a random data
compute.null.cosine.gene <- function(row){
  x <- dat
  #rownames(x) <- colnames(x) <- NULL
 
  permutated <- x
  colnames(x) <- sample(colnames(x))

  permutated <- rbind(
    x[row,], permutated
  )
  tcosine(permutated)[row,-row]
}

# random sampling of correlations to collect correlation of random gene pairs:
# get distribution of correlation for random gene pairs to check a correlation 
# of interest is particularly high or low in the data
compute.null.cosine.genes <- function(alpha){
  rhoMatrix <- tcosine(dat)
  rhoDist   <- as.dist(rhoMatrix)

  nullCosines <- sample(rhoDist, nrow(dat), replace=TRUE)
  quantile(nullCosines, c(alpha/2, 1-alpha/2))
}


## main
dat <- read.delim("~/gitc/data/3_statistics/Sato_A_thaliana-P_syringae_arvRpt2_6h_expRatio_full.txt", header=TRUE, row.names=1)

nullCosineMatrixForGene  <- sapply(1:1000, function(x) compute.null.cosine.gene(1))

obs <- tcosine(dat)[1,-1]

RhoThreshold  <- round(quantile(nullCosineMatrixForGene, 0.975), 3)

par(mfrow=c(2,1))
par(mar=c(4,4,1,1))
histData   <- hist(
  obs, 
  xlim=c(-1,1),
  main="", 
  xlab=expression(paste("Observed correlations (", italic(R), ")"))
)
abline(v=RhoThreshold, col="magenta")
labelX <- RhoThreshold
labelString <- as.character(RhoThreshold)
text(x=RhoThreshold+0.05, y=max(histData$count), 
     label=expression(paste(italic("R")[p=0.05])),
     adj=c(1,1))
par(mar=c(5,4,1,1))
hist(nullCosineMatrixForGene,
  xlim=c(-1,1),
  main="", 
  xlab=expression(paste("Null distribution (", italic(R), ")"))
)
abline(v=RhoThreshold, col="magenta")

3-2.データ中のある遺伝子の発現プロファイルと他のある遺伝子の発現プロファイルのコサイン係数が、データ内の全ての遺伝子ペアから偶然得られる値とは有意に異なるかを_p=0.05_を基準に評価するための閾値となるコサイン係数を求める。

## function
# random sampling of correlations to collect correlation of random gene pairs:
# get distribution of correlation for random gene pairs to check a correlation 
# of interest is particularly high or low in the data
compute.null.cosine.genes <- function(alpha){
  rhoMatrix <- tcosine(dat)
  rhoDist   <- as.dist(rhoMatrix)

  nullCosines <- sample(rhoDist, nrow(dat), replace=TRUE)
  quantile(nullCosines, c(alpha/2, 1-alpha/2))
}

### main
### 上記3-1から継続
nullCosineMatrixForGenes <- sapply(1:1000, function(x) compute.null.cosine.genes(0.05))
RhoThresholdAllGenes     <- rowMeans(nullCosineMatrixForGenes)

cat("Null correlation values at p-value in two-sided:", 
    RhoThresholdAllGenes, sep="\n")

ここで得られる両側検定の値は毎回異なるが目安として-0.8587046, 0.8947758程度の値が得られる。