怎么进行统计学分析? - ricket-sjtu/bi028 GitHub Wiki

统计学分析方法的选择

应变量个数 自变量个数与类型 应变量类型 统计学方法 R函数
1 0 (单个总体) 区间或正态分布 单样本t检验 t.test()
有序或区间 Wilcoxon秩检验 wilcox.test()
二分类 二项式检验 binom.test(), prop.test()
多分类 卡方检验 chisq.test()
1 (单因素两独立水平) 区间或正态分布 独立样本t检验 t.test(.., paired=F)
有序或区间 Wilcoxon Mann-Whitney检验 wilcox.test(.., paired=F)
分类数据 卡方检验或Fisher精确检验 chisq.test(), fisher.test()
1 (单因素多独立水平) 区间或正态分布 One-way ANOVA anova()
有序或区间 Kruskal-Wallis检验 kruskal.test()
分类数据 卡方检验 chisq.test()
1 (单因素两配对水平) 区间或正态分布 配对t检验 t.test(.., paired=T)
有序或区间 Wilcoxon符号秩检验 wilcox.test(.., paired=T)
分类数据 McNemar检验 mcnemar.test()
1 (单因素配对多水平) 区间或正态分布 One-way repeated measure ANOVA car::Anova()
有序或区间 Friedman检验 friedman.test(.., paired=T)
分类数据 Repeated Measured Logistic回归 lme4::glmer(.., family=binomial)
2+ (多因素独立) 区间或正态分布 factorial ANOVA car::Anova()
有序或区间 ordered Logistic回归 MASS::polr()
分类数据 factorial Logistic回归 car::Anova()
1 (区间或连续) 区间或正态分布 相关分析或简单回归分析 cor(.., method="pearson"), lm()
有序或区间 非参数相关 cor(.., method="spearman")
分类数据 简单Logistic回归 glm(.., family=binomial)
2+ (连续或分类变量) 区间或正态分布 多重线性回归,协方差分析 lm(), ancova()
分类 多重Logisitic回归,判别分析 nnet::multinom(), MASS::lda()
2+ 单因素多水平 区间或正态连续 MANOVA分析 manova()
2+ 2+变量 区间或正态连续 多元线性回归 manova(lm), car::Manova(lm)
2 sets of 2+ 0 区间或正态连续 典型相关分析 cancor()
2+ 0 区间或正态连续 因子分析 factanal()

多元多重线性回归

my.model <- lm(cbind(A, B) ~ c + d + e + f + g + H + I)
## uses sequential model comparisons for so-called Type I sum of squares
summary(manova(my.model))
## uses model comparisons for Type II sum of squares
library(car)
Manova(my.model, type="II")

Unbalanced experimental design

Not having equal number of observations in each of the strata.

Consider a model that includes two factors A and B:

  • there are therefore two main effects, and
  • an interaction, AB
  • full model: SS(A, B, AB)
  • model without interaction: SS(A, B)
  • SS(AB | A, B) = SS(A, B, AB) - SS(A, B)
  • SS(A | B, AB) = SS(A, B, AB) - SS(B, AB)
  • SS(B | A, AB) = SS(A, B, AB) - SS(A, AB)
  • SS(A | B) = SS(A, B) - SS(B)
  • SS(B | A) = SS(A, B) - SS(A)

Type I Sum of Squares (SS)

sequential sum of squares

  • SS(A)
  • SS(B | A)
  • SS(AB | A, B)

Type II Sum of Squares (SS)

  • SS(A | B)
  • SS(B | A)

Type III Sum of Squares (SS)

  • SS(A | B, AB)
  • SS(B | A, AB)

Notation

When data is balanced, the factors are orthogonal, and types I, II and III all give the same results.

Demo through simulation

N <- 100                             # number of subjects
c <- rbinom(N, 1, 0.2)               # dichotomous predictor c
H <- rnorm(N, -10, 2)                # continuous predictor H
A <- -1.4*c + 0.6*H + rnorm(N, 0, 3) # Dependent Variable A
B <-  1.4*c - 0.6*H + rnorm(N, 0, 3) # Dependent Variable B
Y <- cbind(A, B)                     # Dependent Variable matrix
my.model <- lm(Y ~ c + H)            # the multivariate model
summary(manova(my.model))            # base-R: SS type I
#           Df  Pillai approx F num Df den Df  Pr(>F)    
# c          1 0.06835   3.5213      2     96 0.03344 *  
# H          1 0.32664  23.2842      2     96 5.7e-09 ***
# Residuals 97
library(car)                           # for Manova()
Manova(my.model, type="II")
# Type II MANOVA Tests: Pillai test statistic
#   Df test stat approx F num Df den Df  Pr(>F)    
# c  1   0.05904   3.0119      2     96 0.05387 .  
# H  1   0.32664  23.2842      2     96 5.7e-09 ***

所以,我们可以看到:

  • 对于stats::manova,使用的是Type I SS,也就是先后计算SS(c)SS(H | c)
  • 对于car::Manova,使用的是Type II SS,也就是先后计算SS(c | H)SS(H | c)
  • stats::anovacar::Anova的区别也是相同的。

逐步解释模型

  1. Build the design matrix $X$
X  <- cbind(1, c, H)
XR <- model.matrix(~ c + H)
all.equal(X, XR, check.attributes=FALSE)
  1. Define the orthogonal projection for the full model: $P_f = X(X^T X)^{-1} X^T$, and then get the matrix $W = Y^T (I - P_f) Y$:
Pf  <- X %*% solve(t(X) %*% X) %*% t(X)
Id  <- diag(N)
WW  <- t(Y) %*% (Id - Pf) %*% Y
  1. Type I SS for restricted and unrestricted models, plus their projections $P_{rI}$ and $P_{uI}$, gives the matrix $B_I = Y^T (P_{uI} - P_{rI}) Y$
XrI <- X[, 1]
PrI <- XrI %*% solve(t(XrI) %*% XrI) %*% t(XrI)
XuI <- X[, c(1,2)]
PuI <- XuI %*% solve(t(XuI) %*% XuI) %*% t(XuI)
Bi <- t(Y) %*% (PuI - PrI) %*% Y
  1. Type II SS for restricted and unrestricted models, plus their projections $P_{rII}$ and $P_{uII}$, leading to the matrix $B_{II} = Y^T (P_{uII} - P_{rII}) Y$:
XrII <- X[, -2]
PrII <- XrII %*% solve(t(XrII) %*% XrII) %*% t(XrII)
PuII <- Pf
Bii  <- t(Y) %*% (PuII - PrII) %*% Y
  1. Pillai-Bartlett trace for both types of SS: $\operatorname{tr}((B+W)^{-1}B)$
(PBTi  <- sum(diag(solve(Bi  + WW) %*% Bi)))   # SS type I
# [1] 0.0683467
(PBTii <- sum(diag(solve(Bii + WW) %*% Bii)))  # SS type II
# [1] 0.05904288
⚠️ **GitHub.com Fallback** ⚠️