数据预处理(Data preprocessing) - ricket-sjtu/bi028 GitHub Wiki
数据的类型
- 数值型(Quantitative)
- 离散型(Quanlitative)
- 无序的(Nominal)
- 有序的(Ordinal)
- 删失数据(Censored):time-to-event, Loss-followup, right-censored
数据转换(Transformation)
- 对数变换(Log-transformation)
- 平方变换(Square tranformation)
- 平方根变换(Squared root transformation)
- Arctan变换
- 其他:如组成数据(compositional data)的ilr、clr、alr等等
去除噪声(Denoising)
缺失值填补(Imputation)
dprep::clean()
:数据清洗,清除其中缺失值超过一定比例的数据行或列;
dprep::ce.impute()
:可对数据中的缺失值进行填补,采用的策略包括均值(mean,数值型)、中位数(median,数值型)、KNN(k-nearest neighbors,数值型)、众数(mode,离散型);
mice::mice()
:可对数据集中的多列数据中的缺失值进行填补。填补方法包括:
pmm
: predictive mean matching (numeric data)
norm
: Bayesian linear regression (numeric)
norm.nob
: Linear regression by ignoring model error (numeric)
norm.boot
: Linear regression using bootstrapping (numeric)
norm.predict
: Linear regression, predicted values (numeric)
mean
: Unconditional mean imputation (numeric)
2l.norm
: Two-level normal imputation (numeric)
2l.pan
: Two-level normal imputation using pan (numeric)
2lonly.mean
: Imputation at level-2 of the class mean (numeric)
2lonly.norm
: imputation at level-2 by Bayesian linear regression (numeric)
2lonly.pmm
: Imputation at level-2 by Predictive mean matching (any)
quadratic
: Imputation of quadratic terms (numeric)
logreg
: logistic regression (binary data)
logreg.boot
: Logistic regression with bootstrapping (binary)
polyreg
:polytomous regression (>2 unordered categorical data)
polr
:proportional odds model (>2 ordered categorical data)
lda
: linear discriminant analysis (factor, >=2 categories)
cart
: Classification and regression trees (any)
rf
: Random forest imputations (any)
ri
: RAndom indicator method for nonignorable data (numeric)
sample
: Random sample from the observed values (any)
fastpmm
: Fast predictive mean matching using C++ (any)