Data cleaning - AgileDataScienceUB/ADS4 GitHub Wiki

The tasks of data preparation is to use information in order to reduce the dimension of the dataset, discarding all the useless attributes and records for analysis purposes. It includes crucial aspects such as outlier detection and missing values handling. Starting from the latter, we could use techniques that consist in:

simply removing the records that contain missing values. It's the easiest way but, obviously, makes us lose a lot of useful information. It should be used just in cases of records that contain missing values for almost every feature.
using statistical measures(e.g. median,mode,avg) to substitute missing values
using data segmentation and using the values of the objects of each segment for calculating the statistical measures used for missing value substitution
using a supervised approach and trasforming the problem into a classification/regression one to establish a value to update a missing value(e.g. using a naive bayes classifier and using probability for each possible value of the attribute that is missing).

The approach we will take is the following:

If the missing value is in the training set, we will substitute a missing value from the row i and feature j by the mean of the feature j conditioned to the target value of the row i. This is, if we have a missing value where the target value of the same row is 1 we will fill the missing value by the the mean of that feature of all rows that have target value equal to 1.

If the missing value is in the test set, we will simply fill the missing values from a feature j by the mean of the feature j.

Outliers

For what concerns outlier detection, it's not straightforward the same definition of an outlier. It can be seen as: “An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism mechanism” [Hawkins, 1980]. In this sense, we could use some statistics to find a generating mechanism to our values, and identifying outliers as the values that deviate from this mechanism. It's very common to use simple scatter plots for visualizing identifying outliers; by the way this approach is often too simplistic. A lot of techniques can help us in this task:

Statistical models (depth based or deviation based approaches)
Spacial approaches (distance based or density based ones).