数据预处理(Data preprocessing) - ricket-sjtu/bi028 GitHub Wiki
数据的类型
- 数值型(Quantitative)
- 离散型(Quanlitative)
- 无序的(Nominal)
- 有序的(Ordinal)
- 删失数据(Censored):time-to-event, Loss-followup, right-censored
数据转换(Transformation)
- 对数变换(Log-transformation)
- 平方变换(Square tranformation)
- 平方根变换(Squared root transformation)
- Arctan变换
- 其他:如组成数据(compositional data)的ilr、clr、alr等等
去除噪声(Denoising)
缺失值填补(Imputation)
dprep::clean():数据清洗,清除其中缺失值超过一定比例的数据行或列;
dprep::ce.impute():可对数据中的缺失值进行填补,采用的策略包括均值(mean,数值型)、中位数(median,数值型)、KNN(k-nearest neighbors,数值型)、众数(mode,离散型);
mice::mice():可对数据集中的多列数据中的缺失值进行填补。填补方法包括:
pmm: predictive mean matching (numeric data)
norm: Bayesian linear regression (numeric)
norm.nob: Linear regression by ignoring model error (numeric)
norm.boot: Linear regression using bootstrapping (numeric)
norm.predict: Linear regression, predicted values (numeric)
mean: Unconditional mean imputation (numeric)
2l.norm: Two-level normal imputation (numeric)
2l.pan: Two-level normal imputation using pan (numeric)
2lonly.mean: Imputation at level-2 of the class mean (numeric)
2lonly.norm: imputation at level-2 by Bayesian linear regression (numeric)
2lonly.pmm: Imputation at level-2 by Predictive mean matching (any)
quadratic: Imputation of quadratic terms (numeric)
logreg: logistic regression (binary data)
logreg.boot: Logistic regression with bootstrapping (binary)
polyreg:polytomous regression (>2 unordered categorical data)
polr:proportional odds model (>2 ordered categorical data)
lda: linear discriminant analysis (factor, >=2 categories)
cart: Classification and regression trees (any)
rf: Random forest imputations (any)
ri: RAndom indicator method for nonignorable data (numeric)
sample: Random sample from the observed values (any)
fastpmm: Fast predictive mean matching using C++ (any)