Outlier analysis and anomalies detection - clizarraga-UAD7/Workshops GitHub Wiki

(Image credit: Veronica Benavides, Unsplash.com)

When modeling, it is important to clean the data sample to ensure that the observations best represent the problem.

Occasionally, a data set may contain outliers that fall outside the range of what is expected and do not resemble the other data. These are called outliers, and often modeling and machine learning skills in general can be improved by understanding and even eliminating these outliers.

Main questions to answer:

  • An outlier is an uncertain observation in a data set and can have one of many causes.
  • How to use simple univariate statistics such as standard deviation and interquartile range to identify and remove outliers from a sample of data.
  • How to use outlier detection model to identify and remove rows from training dataset to improve predictive model performance.

What is an outlier?

In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are sometimes excluded from the data set.

The Standard Deviation Method

If we know that the distribution of values ​​in the sample is Gaussian or Gaussian-like, then we can use the sample standard deviation as the threshold to determine outliers.

A Gaussian distribution whose property is that the standard deviation from the mean can be used to reliably summarize the percentage of values ​​in a sample.

We can cover more of the data sample if we expand the range as follows:

  1. Standard Deviation from the Mean: 68%
  2. Standard Deviations from the Mean: 95%
  3. Standard Deviations from the Mean: 99.7%

A value that falls outside of 3 standard deviations is part of the distribution, but it is an unlikely or rare event

Three standard deviations from the mean is a common threshold in practice for identifying outliers in a Gaussian or Gaussian-like distribution. For smaller data samples, a value of 2 standard deviations (95%) can be used, and for larger samples a value of 4 standard deviations (99.9%).

Standard score

Sometimes the data is normalized first (e.g. to a Z-score with zero mean and unit variance) so that outlier detection can be performed using standard cut-off values, standard of the Z-score. This is a utility and is usually optional, and we will be doing calculations at the original scale of the data here to make things clearer.

We can calculate the mean and standard deviation of a given sample, and then calculate a threshold to identify outliers that are 3.

# identify outliers with standard deviation
import numpy as np

# seed the random number generator
np.random.seed(42)

# generate univariate observations
data = 5 * np.random.randn(1000) + 50

# calculate summary statistics
data_mean, data_std = np.mean(data), np.std(data)

print('data_mean = %.2f' % data_mean, ', data_std = %.2f' % data_std)

# identify outliers
cut_off = data_std * 3
lower, upper = data_mean - cut_off, data_mean + cut_off

# identify outliers
outliers = [x for x in data if x < lower or x > upper]
print('Identified outliers: %d' % len(outliers))

# remove outliers
outliers_removed = [x for x in data if x >= lower and x <= upper]
print('Non-outlier observations: %d' % len(outliers_removed))


Interquartile Range Method

There will be cases when we do not know the type of probability distribution of a dataset, or the data does not follow a Gaussian distribution.

In these cases, we can use the interquartile range IQR, used extensively in producing boxplots to measure the spread of data by using quartiles and defined as IQR = Q3 - Q1.

John W. Tukey wrote the book Exploratory Data Analysis in 1977, where he held that too much emphasis in statistics was placed on statistical hypothesis testing and more emphasis needed to be placed on using data to suggest hypotheses to test. Exploratory Data Analysis does not need any previous assumption on the statistical distribution of the underlying data.

Tukey suggested computing the five-number summary of numerical data: the two extremes (maximum and minimum), the median, and the quartiles since they are defined for all empirical distribution. There is also a Seven-number summary, being an extension of the five-number summary.

Tukey also gives a criteria for defining outlier data. If Q1, and Q3 are the first and third quartile positions, the interquartile range IQR = Q3 - Q1 , then an outlier value will fall below Q1 - 1.5 IQR or above Q3 + 1.5 IQR.

We can define an additional Tukey outer fences, lower outer fence Q1 - 3.0 IQR and the upper outer fence above Q3 + 3.0 IQR.

#Tukey's method 

def tukeys_method(df, variable):
    #Takes two parameters: dataframe & variable of interest as string
    q1 = df[variable].quantile(0.25)
    q3 = df[variable].quantile(0.75)
    # Interquartile range
    iqr = q3-q1
    # Inner and outer fences
    inner_fence = 1.5*iqr
    outer_fence = 3*iqr
    
    #inner fence lower and upper end
    inner_fence_le = q1-inner_fence
    inner_fence_ue = q3+inner_fence
    # shorter notation
    ifl, ifu = inner_fence_le, inner_fence_ue
    
    #outer fence lower and upper end
    outer_fence_le = q1-outer_fence
    outer_fence_ue = q3+outer_fence
    # shorter notation
    ofl, ofu = outer_fence_le, outer_fence_ue
    
    # Initial empty lists
    outliers_prob = []
    outliers_poss = []

    for index, x in enumerate(df[variable]):
        if x <= outer_fence_le or x >= outer_fence_ue:
            outliers_prob.append(index)
    for index, x in enumerate(df[variable]):
        if x <= inner_fence_le or x >= inner_fence_ue:
            outliers_poss.append(index)
    return outliers_prob, outliers_poss, ifl, ifu, ofl, ofu
        

Cook's distance

In statistics, Cook's distance or Cook's D is a commonly used estimate of the influence of a data point when performing a least-squares regression analysis. In a practical ordinary least squares analysis, Cook's distance can be used in several ways: to indicate influential data points that are particularly worth checking for validity; or to indicate regions of the design space where it would be good to be able to obtain more data points. It is named after the American statistician R. Dennis Cook, who introduced the concept in 1977.

Isolation Forest

Isolation Forest is an algorithm for data anomaly detection. It detects anomalies using isolation (how far a data point is from the rest of the data). This method deviates from the mainstream philosophy that underpinned most existing anomaly detectors at the time. Instead of profiling all normal instances before anomalies are identified, Isolation Forest detects anomalies using binary trees. The algorithm has a linear time complexity with a low constant and a low memory requirement, which works well with high-volume data.


Jupyter Notebook example

Please see: Outlier Detection


More resources


Created: 02-02-2023 (C. Lizárraga); Updated: 02-09-2023 (C. Lizárraga).

CC BY-NC-SA