12 Outlier Detection - Observatorio-do-Trabalho-de-Pernambuco/documentation GitHub Wiki

12.1. What is a outlier?

It is common in datasets to have some rows that do not make sense. This problem occurs if one or more values ​​in the same row are outside the expected range. We call these records outliers.

Several reasons can cause outliers to appear, from poor calibration of measurement equipment to human error when manually recording data. In other words, whenever a dataset has a considerable size, it is safe to assume that it contains some outliers.

Analyzing a data set that contains outliers is a problematic process. The presence of these errors causes inaccuracy in the results. In most cases, outliers are data that are so different from the rest that they completely alter the mean and standard deviation values, making the analysis unreliable. Because of this, it is best to always remove outliers from the dataset before you start working.

12.2. Outlier Detection

There are several ways to identify whether a line is an outlier, the process basically consists of comparing each line with the rest and checking whether any value is too far from the distribution.

The simplest way is to define a maximum and/or minimum threshold value and consider all data that is above or below (respectively) these values ​​as outliers. The problem with this approach is that it will always be necessary to study the data to define the best threshold to be used.

Another common way to carry out the process is to define a range of valid values ​​based on the mean and standard deviation of the distribution. In this way, the process adapts to the context of the data, making it less difficult since it is not necessary to define a cut-off value, but rather a tolerance range based on the data distribution itself. The problem with this approach is that outliers affect the distribution, which can end up causing incorrect identification in some cases.

12.3. Strategies Used in the Project

Both detection methods were implemented (with limits and with standard deviation), however it was seen that it would be necessary to carry out more complex processes to bring greater reliability to the process of automatically detecting outliers.

12.3.1. Grouping

As each profession has its own salary distribution, it is not a good idea to carry out a single detection process for everyone, so it was necessary to first group the data by profession and then carry out the process separately, thus considering the differences between the salary value distributions.

12.3.2. Salary Progression

When considering large time periods in the data (sets with data from at least one year) we must take into account the progression of the minimum wage value.

It is expected that with each adjustment of the minimum wage the distribution of all wages will follow the new value, therefore, what was previously considered an outlier may no longer be considered an outlier or vice versa.

The strategy used to introduce this variable into the process was to perform a second grouping, by date and organized in ascending order, and whenever a month is processed where there was an adjustment to the minimum wage, multiply the standard deviation range used by the percentage value of the new minimum wage in relation to the previous one.

For a month i we can define the used standart deviation value based on the previous one and the minimum wage progression (except the first).

$$std_i = std_{i-1} \cdot \frac{MinWag_i}{MinWag_{i-1}}$$

12.3.2. Cross-Referencing Data With RAIS

RAIS database contains more reliable records. This can be used to check the highest valid salary for each CBO and compare with values in CAGED, any record with a salary higher than maximum in RAIS for this CBO is considered outlier.

12.4. Microsalaries identification

There are records in the CAGED database of very small salaries that should also be considered outliers.

The strategies presented above focused on identifying only the outliers above the distribution, so they are not sufficient to identify these records.

The proposed method for identifying these outliers is to use a combination of the average salary per CBO in the RAIS database together with the value of the minimum wage on the date of record, any records with a salary below this value will be considered an outlier.

The calculation of the limit is given by the lowest value between the minimum wage and a portion (between 10% and 25%) of the average wage in the RAIS base.

$$limit = min(MinWage, AVGWage \cdot proportion)$$

12.5. Results

The grouping strategy is extremely necessary to identify outliers, as each profession has its own distribution.

Cross-referencing data with Rais proved to be a more efficient strategy than salary progression, providing faster processing and more reliable results.