数据预处理（Data preprocessing） - ricket-sjtu/bi028 GitHub Wiki

数据的类型

数值型（Quantitative）
离散型（Quanlitative）

无序的（Nominal）
有序的（Ordinal）

删失数据（Censored）：time-to-event, Loss-followup, right-censored

数据转换（Transformation）

对数变换（Log-transformation）
平方变换（Square tranformation）
平方根变换（Squared root transformation）
Arctan变换
其他：如组成数据（compositional data）的ilr、clr、alr等等

去除噪声（Denoising）

缺失值填补（Imputation）

dprep::clean()：数据清洗，清除其中缺失值超过一定比例的数据行或列；
dprep::ce.impute()：可对数据中的缺失值进行填补，采用的策略包括均值（mean，数值型）、中位数（median，数值型）、KNN（k-nearest neighbors，数值型）、众数（mode，离散型）；
mice::mice()：可对数据集中的多列数据中的缺失值进行填补。填补方法包括：
- pmm: predictive mean matching (numeric data)
- norm: Bayesian linear regression (numeric)
- norm.nob: Linear regression by ignoring model error (numeric)
- norm.boot: Linear regression using bootstrapping (numeric)
- norm.predict: Linear regression, predicted values (numeric)
- mean: Unconditional mean imputation (numeric)
- 2l.norm: Two-level normal imputation (numeric)
- 2l.pan: Two-level normal imputation using pan (numeric)
- 2lonly.mean: Imputation at level-2 of the class mean (numeric)
- 2lonly.norm: imputation at level-2 by Bayesian linear regression (numeric)
- 2lonly.pmm: Imputation at level-2 by Predictive mean matching (any)
- quadratic: Imputation of quadratic terms (numeric)
- logreg: logistic regression (binary data)
- logreg.boot: Logistic regression with bootstrapping (binary)
- polyreg：polytomous regression (>2 unordered categorical data)
- polr：proportional odds model (>2 ordered categorical data)
- lda: linear discriminant analysis (factor, >=2 categories)
- cart: Classification and regression trees (any)
- rf: Random forest imputations (any)
- ri: RAndom indicator method for nonignorable data (numeric)
- sample: Random sample from the observed values (any)
- fastpmm: Fast predictive mean matching using C++ (any)