Data transformation - AgileDataScienceUB/ADS4 GitHub Wiki

It consists of defining a transformation T on the attribute X: Y = T(X), such as: Y preserve the relevant information of X; it eliminates at least one problem of X; and it's more useful than X. It can be very useful for incomplete data, unequally distributed ones and so on. In our case, it's needed for our data integration and for classification task too: for having distinct scale attributes, we may use normalization to fix the problem. It's common to use normalization features as:

  • min-max normalization,
  • Z-Score normalization.

for scale issues and for having a "regular" distribution of data, as in our problem.

Another issue that we've to face regards the need of our models that we're going to use of working with numerical attributes only. It's strictly mandatory to transform categorical/ordinal attributes into numerical ones. A lot of strategies are feasible for this purpose:

  • using a discretization approach
  • using a mapping function on the categorical attributes
  • using an hash function on the categorical attributes

Categorical data treatment

Instead of using the well known one-hot encoding approach, our model follows the following strategy. If a categorical feature F has only two unique values we remplace it by a boolean feature. If it takes categories x_1, ...,x_m with m > 2 we substitute each category x_i by the mean of target value condicioned by F=x_m.

For example, imagine that we have a feature Job that takes categories analist, domain and engineer. We would substitue each row that takes the category analist for the mean of the target value conditioned that Job =analist