怎么进行统计学分析？ - ricket-sjtu/bi028 GitHub Wiki

统计学分析方法的选择

应变量个数	自变量个数与类型	应变量类型	统计学方法	R函数
1	0 (单个总体)	区间或正态分布	单样本t检验	`t.test()`
		有序或区间	Wilcoxon秩检验	`wilcox.test()`
		二分类	二项式检验	`binom.test(), prop.test()`
		多分类	卡方检验	`chisq.test()`
	1 (单因素两独立水平)	区间或正态分布	独立样本t检验	`t.test(.., paired=F)`
		有序或区间	Wilcoxon Mann-Whitney检验	`wilcox.test(.., paired=F)`
		分类数据	卡方检验或Fisher精确检验	`chisq.test(), fisher.test()`
	1 (单因素多独立水平)	区间或正态分布	One-way ANOVA	`anova()`
		有序或区间	Kruskal-Wallis检验	`kruskal.test()`
		分类数据	卡方检验	`chisq.test()`
	1 (单因素两配对水平)	区间或正态分布	配对t检验	`t.test(.., paired=T)`
		有序或区间	Wilcoxon符号秩检验	`wilcox.test(.., paired=T)`
		分类数据	McNemar检验	`mcnemar.test()`
	1 (单因素配对多水平)	区间或正态分布	One-way repeated measure ANOVA	`car::Anova()`
		有序或区间	Friedman检验	`friedman.test(.., paired=T)`
		分类数据	Repeated Measured Logistic回归	`lme4::glmer(.., family=binomial)`
	2+ (多因素独立)	区间或正态分布	factorial ANOVA	`car::Anova()`
		有序或区间	ordered Logistic回归	`MASS::polr()`
		分类数据	factorial Logistic回归	`car::Anova()`
	1 (区间或连续)	区间或正态分布	相关分析或简单回归分析	`cor(.., method="pearson"), lm()`
		有序或区间	非参数相关	`cor(.., method="spearman")`
		分类数据	简单Logistic回归	`glm(.., family=binomial)`
	2+ (连续或分类变量)	区间或正态分布	多重线性回归，协方差分析	`lm(), ancova()`
	2+ (连续或分类变量)	分类	多重Logisitic回归，判别分析	`nnet::multinom(), MASS::lda()`
2+	单因素多水平	区间或正态连续	MANOVA分析	`manova()`
2+	2+变量	区间或正态连续	多元线性回归	`manova(lm), car::Manova(lm)`
2 sets of 2+	0	区间或正态连续	典型相关分析	`cancor()`
2+	0	区间或正态连续	因子分析	`factanal()`

多元多重线性回归

my.model <- lm(cbind(A, B) ~ c + d + e + f + g + H + I)
## uses sequential model comparisons for so-called Type I sum of squares
summary(manova(my.model))
## uses model comparisons for Type II sum of squares
library(car)
Manova(my.model, type="II")

Unbalanced experimental design

Not having equal number of observations in each of the strata.

Consider a model that includes two factors A and B:

there are therefore two main effects, and
an interaction, AB
full model: SS(A, B, AB)
model without interaction: SS(A, B)
SS(AB | A, B) = SS(A, B, AB) - SS(A, B)
SS(A | B, AB) = SS(A, B, AB) - SS(B, AB)
SS(B | A, AB) = SS(A, B, AB) - SS(A, AB)
SS(A | B) = SS(A, B) - SS(B)
SS(B | A) = SS(A, B) - SS(A)

Type I Sum of Squares (SS)

sequential sum of squares

SS(A)
SS(B | A)
SS(AB | A, B)

Type II Sum of Squares (SS)

SS(A | B)
SS(B | A)

Type III Sum of Squares (SS)

SS(A | B, AB)
SS(B | A, AB)

Notation

When data is balanced, the factors are orthogonal, and types I, II and III all give the same results.

Demo through simulation

N <- 100                             # number of subjects
c <- rbinom(N, 1, 0.2)               # dichotomous predictor c
H <- rnorm(N, -10, 2)                # continuous predictor H
A <- -1.4*c + 0.6*H + rnorm(N, 0, 3) # Dependent Variable A
B <-  1.4*c - 0.6*H + rnorm(N, 0, 3) # Dependent Variable B
Y <- cbind(A, B)                     # Dependent Variable matrix
my.model <- lm(Y ~ c + H)            # the multivariate model
summary(manova(my.model))            # base-R: SS type I
#           Df  Pillai approx F num Df den Df  Pr(>F)    
# c          1 0.06835   3.5213      2     96 0.03344 *  
# H          1 0.32664  23.2842      2     96 5.7e-09 ***
# Residuals 97
library(car)                           # for Manova()
Manova(my.model, type="II")
# Type II MANOVA Tests: Pillai test statistic
#   Df test stat approx F num Df den Df  Pr(>F)    
# c  1   0.05904   3.0119      2     96 0.05387 .  
# H  1   0.32664  23.2842      2     96 5.7e-09 ***

所以，我们可以看到：

对于stats::manova，使用的是Type I SS，也就是先后计算SS(c)和SS(H | c)；
对于car::Manova，使用的是Type II SS，也就是先后计算SS(c | H)和SS(H | c)；
stats::anova和car::Anova的区别也是相同的。

逐步解释模型

Build the design matrix $X$

X  <- cbind(1, c, H)
XR <- model.matrix(~ c + H)
all.equal(X, XR, check.attributes=FALSE)

Define the orthogonal projection for the full model: $P_f = X(X^T X)^{-1} X^T$, and then get the matrix $W = Y^T (I - P_f) Y$:

Pf  <- X %*% solve(t(X) %*% X) %*% t(X)
Id  <- diag(N)
WW  <- t(Y) %*% (Id - Pf) %*% Y

Type I SS for restricted and unrestricted models, plus their projections $P_{rI}$ and $P_{uI}$, gives the matrix $B_I = Y^T (P_{uI} - P_{rI}) Y$：

XrI <- X[, 1]
PrI <- XrI %*% solve(t(XrI) %*% XrI) %*% t(XrI)
XuI <- X[, c(1,2)]
PuI <- XuI %*% solve(t(XuI) %*% XuI) %*% t(XuI)
Bi <- t(Y) %*% (PuI - PrI) %*% Y

Type II SS for restricted and unrestricted models, plus their projections $P_{rII}$ and $P_{uII}$, leading to the matrix $B_{II} = Y^T (P_{uII} - P_{rII}) Y$:

XrII <- X[, -2]
PrII <- XrII %*% solve(t(XrII) %*% XrII) %*% t(XrII)
PuII <- Pf
Bii  <- t(Y) %*% (PuII - PrII) %*% Y

Pillai-Bartlett trace for both types of SS: $\operatorname{tr}((B+W)^{-1}B)$

(PBTi  <- sum(diag(solve(Bi  + WW) %*% Bi)))   # SS type I
# [1] 0.0683467
(PBTii <- sum(diag(solve(Bii + WW) %*% Bii)))  # SS type II
# [1] 0.05904288

怎么进行统计学分析？ - ricket-sjtu/bi028 GitHub Wiki

统计学分析方法的选择

多元多重线性回归

Unbalanced experimental design

Type I Sum of Squares (SS)

Type II Sum of Squares (SS)

Type III Sum of Squares (SS)

Notation

Demo through simulation

逐步解释模型

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️