数据预处理(Data preprocessing) - ricket-sjtu/bi028 GitHub Wiki

数据的类型

  1. 数值型(Quantitative)
  2. 离散型(Quanlitative)
  • 无序的(Nominal)
  • 有序的(Ordinal)
  1. 删失数据(Censored):time-to-event, Loss-followup, right-censored

数据转换(Transformation)

  • 对数变换(Log-transformation)
  • 平方变换(Square tranformation)
  • 平方根变换(Squared root transformation)
  • Arctan变换
  • 其他:如组成数据(compositional data)的ilr、clr、alr等等

去除噪声(Denoising)

缺失值填补(Imputation)

  • dprep::clean():数据清洗,清除其中缺失值超过一定比例的数据行或列;
  • dprep::ce.impute():可对数据中的缺失值进行填补,采用的策略包括均值(mean,数值型)、中位数(median,数值型)、KNN(k-nearest neighbors,数值型)、众数(mode,离散型);
  • mice::mice():可对数据集中的多列数据中的缺失值进行填补。填补方法包括:
    • pmm: predictive mean matching (numeric data)
    • norm: Bayesian linear regression (numeric)
    • norm.nob: Linear regression by ignoring model error (numeric)
    • norm.boot: Linear regression using bootstrapping (numeric)
    • norm.predict: Linear regression, predicted values (numeric)
    • mean: Unconditional mean imputation (numeric)
    • 2l.norm: Two-level normal imputation (numeric)
    • 2l.pan: Two-level normal imputation using pan (numeric)
    • 2lonly.mean: Imputation at level-2 of the class mean (numeric)
    • 2lonly.norm: imputation at level-2 by Bayesian linear regression (numeric)
    • 2lonly.pmm: Imputation at level-2 by Predictive mean matching (any)
    • quadratic: Imputation of quadratic terms (numeric)
    • logreg: logistic regression (binary data)
    • logreg.boot: Logistic regression with bootstrapping (binary)
    • polyreg:polytomous regression (>2 unordered categorical data)
    • polr:proportional odds model (>2 ordered categorical data)
    • lda: linear discriminant analysis (factor, >=2 categories)
    • cart: Classification and regression trees (any)
    • rf: Random forest imputations (any)
    • ri: RAndom indicator method for nonignorable data (numeric)
    • sample: Random sample from the observed values (any)
    • fastpmm: Fast predictive mean matching using C++ (any)