缺失值处理(Missing value processing) - ricket-sjtu/bi028 GitHub Wiki

文章目的

    1. 了解什么是<NA>
    1. 掌握Missing data imputation的思路和方法
    1. 如何选择imputation的方法

什么是Missing data

在我们需要处理和分析的数据中,常常存在缺失值,在R中,缺失值用NA表示,注意的是NaN不是缺失值,而是“not a number”,当然有时候对NA的运算处理可能也会产生NaN

  • read.table()等函数读入数据时,需要指定na.strings="NA"对应的字符串;
  • 常规处理NA的方式是不考虑包含缺失值的数据,这就是所谓的(.., na.rm=TRUE)
  • 或者用stats::complete.cases()函数选择完整的观察样本:subset(airquality, complete.cases(airquality))
  • stats::na.omit(airquality)
  • base::is.na()只能针对单个变量,例如sum(is.na(airquality$Ozone))

基本的Imputation方法

  • 用0代替
  • 用均值代替
  • 用中位数代替
  • 用mode代替
  • 最小值或最大值

其他方法

  • mice:multivariate imputation via chained equation。假设missing at random (MAR),也就是说数据缺失的概率仅与其他观察值有关,所以可以通过预测进行估计。这是一种参数型方法,对于不同的缺失值变量采用不同的回归或者其他方法进行imputation:
    • PMM (Predictive Mean Matching) – For numeric variables
    • logreg(Logistic Regression) – For Dichotomous Variables( with 2 levels)
    • polyreg(Bayesian polytomous regression) – For Factor Variables (>= 2 levels)
    • Proportional odds model (ordered, >= 2 levels)
  • missForest:multiple imputation based on random forest。这是非参方法,可以应用于不同的变量类型
  • Hmisc
    • impute():采用用户自定义的统计方法,如均值、中位值、众数等
    • aregImpute():采用additive regression, bootstrapping, predictive mean matching等手段
    • 假设线性模型;
    • Fisher最优打分(optimum scoring method)策略可用于预测缺失的分类变量
  • Amelia
    • multiple imputation to create more robust and reduce bias
    • 基于bootstrap的经验贝叶斯方法(empirical bayesian, EMB)估计
    • 可用于横截面数据(cross-sectional)以及时间序列数据(time-series)
    • 与MICE不同,其基于的统计学原理是多元正态分布(multivariate normal, MVN)

案例1

library(missForest)
library(mice)
data(airquality)

summary(airquality)
## generate missing data
airquality.mis <- prodNA(airquality, noNA=.06) # proportion of NA in all the entries of the data
summary(airquality.mis)
airquality.mis <- subset(airquality.mis, select=-c(5,6))
summary(airquality.mis)

## check the pattern of missing
mice::md.pattern(airquality.mis)

## plot the pattern
install.packages("VIM")
library(VIM)
airquality_naplot <- aggr(airquality.mis, col=c("navyblue", "yellow"),
            numbers=TRUE, sortVars=TRUE,
            labels=names(airquality.mis), 
            gap=3, ylab=c("Missing Data", "Missing Pattern"))

## impute the missing values using predictive mean matching (PMM), generates 5 datasets
airquality.imputed <- mice(data=airquality.mis, m=5, method="pmm", 
                           maxit=50, seed=500)

## data are contained in $imp, a list containing all the variables
airquality.imputed$imp
airquality.imputed$imp$Ozone  # data.frame with 5 columns


## select one of the complete imputed data set
airquality.complete <- complete(airquality.imputed, 2)
dim(airquality.complete)

案例2

library(Amelia)
library(missForest)
data(iris)
iris.mis <- prodNA(iris, noNA=.1)
iris.amelia <- amelia(iris.mis, m=5, parallel="multicore", 
                    noms="Species", idvars=c())
iris.amelia$imputations[1](/ricket-sjtu/bi028/wiki/1) # first imputed data set

## write the results as 1.csv, 2.csv, ..., 5.csv
write.amelia(iris.amelia, file.stem="./")

## plot the distributions for both observed and imputed variables
plot(iris.amelia)