Anomaly Detection - microsoft/AutoBrewML GitHub Wiki

Anomaly detection aims to detect abnormal patterns deviating from the rest of the data, called anomalies or outliers. Handling Outliers and anomalies is critical to the machine learning process. Outliers can impact the results of our analysis and statistical modeling in a drastic way. Our tendency is to use straightforward methods like box plots, histograms and scatter-plots to detect outliers. But dedicated outlier detection algorithms are extremely valuable in fields which process large amounts of data and require a means to perform pattern recognition in larger datasets. The PyOD library can step in to bridge this gap, which is a comprehensive and scalable Python toolkit for detecting outlying objects in multivariate data. We will be using the following algorithms within PyOD to detect and analyze the Outliers and indicate their presence in datasets.


1. Angle-Based Outlier Detection (ABOD)
It considers the relationship between each point and its neighbor(s). It does not consider the relationships among these neighbors. The variance of its weighted cosine scores to all neighbors could be viewed as the outlying score. ABOD performs well on multi-dimensional data


2. k-Nearest Neighbors Detector
For any data point, the distance to its kth nearest neighbor could be viewed as the outlying score. PyOD supports three kNN detectors:
Largest: Uses the distance of the kth neighbor as the outlier score
Mean: Uses the average of all k neighbors as the outlier score
Median: Uses the median of the distance to k neighbors as the outlier score


3. Isolation Forest
It uses the scikit-learn library internally. In this method, data partitioning is done using a set of trees. Isolation Forest provides an anomaly score looking at how isolated the point is in the structure. The anomaly score is then used to identify outliers from normal observations.
Isolation Forest performs well on multi-dimensional data.


4. Histogram-based Outlier Detection
It is an efficient unsupervised method which assumes the feature independence and calculates the outlier score by building histograms. It is much faster than multivariate approaches, but at the cost of less precision


5. Local Correlation Integral (LOCI)
LOCI is very effective for detecting outliers and groups of outliers. It provides a LOCI plot for each point which summarizes a lot of the information about the data in the area around the point, determining clusters, micro-clusters, their diameters, and their inter-cluster distances. None of the existing outlier-detection methods can match this feature because they output only a single number for each point


6. Feature Bagging
A feature bagging detector fits a number of base detectors on various sub-samples of the dataset. It uses averaging or other combination methods to improve the prediction accuracy. By default, Local Outlier Factor (LOF) is used as the base estimator. However, any estimator could be used as the base estimator, such as kNN and ABOD. Feature bagging first constructs n sub-samples by randomly selecting a subset of features. This brings out the diversity of base estimators. Finally, the prediction score is generated by averaging or taking the maximum of all base detectors.


7. Clustering Based Local Outlier Factor
It classifies the data into small clusters and large clusters. The anomaly score is then calculated based on the size of the cluster the point belongs to, as well as the
distance to the nearest large cluster.
Using each of the above algorithms we would estimate the number of outliers and inliers and assign the dataset points a Boolean value to identify them as inliers and
outliers separately. We would allow user intervention to take the final call on which outliers to remove from the dataset and retrain in the model henceforth.
Anomalies are not always bad data, instead they can reveal data trends which play a key role in predictions sometimes. Hence it is important to analyze the anomalies thus pointed but not get rid of them blindly.

⚠️ **GitHub.com Fallback** ⚠️