缺失值处理(Missing value processing) - ricket-sjtu/bi028 GitHub Wiki
文章目的
-
- 了解什么是
<NA>
- 了解什么是
-
- 掌握
Missing data imputation
的思路和方法
- 掌握
-
- 如何选择imputation的方法
什么是Missing data
在我们需要处理和分析的数据中,常常存在缺失值,在R中,缺失值用NA
表示,注意的是NaN
不是缺失值,而是“not a number”,当然有时候对NA
的运算处理可能也会产生NaN
。
- 用
read.table()
等函数读入数据时,需要指定na.strings="NA"
对应的字符串; - 常规处理
NA
的方式是不考虑包含缺失值的数据,这就是所谓的(.., na.rm=TRUE)
; - 或者用
stats::complete.cases()
函数选择完整的观察样本:subset(airquality, complete.cases(airquality))
; stats::na.omit(airquality)
;base::is.na()
只能针对单个变量,例如sum(is.na(airquality$Ozone))
;
基本的Imputation方法
- 用0代替
- 用均值代替
- 用中位数代替
- 用mode代替
- 最小值或最大值
其他方法
mice
:multivariate imputation via chained equation。假设missing at random (MAR)
,也就是说数据缺失的概率仅与其他观察值有关,所以可以通过预测进行估计。这是一种参数型方法,对于不同的缺失值变量采用不同的回归或者其他方法进行imputation:PMM (Predictive Mean Matching)
– For numeric variableslogreg(Logistic Regression)
– For Dichotomous Variables( with 2 levels)polyreg(Bayesian polytomous regression)
– For Factor Variables (>= 2 levels)Proportional odds model
(ordered, >= 2 levels)
missForest
:multiple imputation based on random forest。这是非参方法,可以应用于不同的变量类型Hmisc
impute()
:采用用户自定义的统计方法,如均值、中位值、众数等aregImpute()
:采用additive regression, bootstrapping, predictive mean matching等手段- 假设线性模型;
- Fisher最优打分(optimum scoring method)策略可用于预测缺失的分类变量
Amelia
- multiple imputation to create more robust and reduce bias
- 基于bootstrap的经验贝叶斯方法(empirical bayesian, EMB)估计
- 可用于横截面数据(cross-sectional)以及时间序列数据(time-series)
- 与MICE不同,其基于的统计学原理是多元正态分布(multivariate normal, MVN)
案例1
library(missForest)
library(mice)
data(airquality)
summary(airquality)
## generate missing data
airquality.mis <- prodNA(airquality, noNA=.06) # proportion of NA in all the entries of the data
summary(airquality.mis)
airquality.mis <- subset(airquality.mis, select=-c(5,6))
summary(airquality.mis)
## check the pattern of missing
mice::md.pattern(airquality.mis)
## plot the pattern
install.packages("VIM")
library(VIM)
airquality_naplot <- aggr(airquality.mis, col=c("navyblue", "yellow"),
numbers=TRUE, sortVars=TRUE,
labels=names(airquality.mis),
gap=3, ylab=c("Missing Data", "Missing Pattern"))
## impute the missing values using predictive mean matching (PMM), generates 5 datasets
airquality.imputed <- mice(data=airquality.mis, m=5, method="pmm",
maxit=50, seed=500)
## data are contained in $imp, a list containing all the variables
airquality.imputed$imp
airquality.imputed$imp$Ozone # data.frame with 5 columns
## select one of the complete imputed data set
airquality.complete <- complete(airquality.imputed, 2)
dim(airquality.complete)
案例2
library(Amelia)
library(missForest)
data(iris)
iris.mis <- prodNA(iris, noNA=.1)
iris.amelia <- amelia(iris.mis, m=5, parallel="multicore",
noms="Species", idvars=c())
iris.amelia$imputations[1](/ricket-sjtu/bi028/wiki/1) # first imputed data set
## write the results as 1.csv, 2.csv, ..., 5.csv
write.amelia(iris.amelia, file.stem="./")
## plot the distributions for both observed and imputed variables
plot(iris.amelia)