OutlierDetection - juedaiyuer/researchNote GitHub Wiki
#Outlier detection on a real data set#
为知笔记已归档
Outlier Detection(孤立点检测),异常检测
##load_boston##
导入和返回boston房价数据集
Load and return the boston house-prices dataset (regression)
样本总数506,维度13
>>> from sklearn.datasets import load_boston
>>> boston = load_boston()
>>> print(boston.data.shape)
(506, 13)
##Mahalanobis Distance##
马氏距离是由印度统计学家马哈拉诺比斯(P. C. Mahalanobis)提出的,表示数据的协方差距离。它是一种有效的计算两个未知样本集的相似度的方法。与欧式距离不同的是它考虑到各种特性之间的联系(例如:一条关于身高的信息会带来一条关于体重的信息,因为两者是有关联的),并且是尺度无关的(scale-invariant),即独立于测量尺度。
##异常检测的方法##
###EllipticEnvelope###
在一个高斯分布的数据集中,检测孤立点
An object for detecting outliers in a Gaussian distributed dataset.
分离出污染,叫做异常
Outlier detection is similar to novelty detection in the sense that the goal is to separate a core of regular observations from some polluting ones, called “outliers”. Yet, in the case of outlier detection, we don’t have a clean data set representing the population of regular observations that can be used to train any tool.
基于协方差的稳健估计,假设数据是高斯分布的,执行效果将优于One-Class SVM,scikit-learn提供了一个对象叫做EllipticEnvelope
class sklearn.covariance.EllipticEnvelope(store_precision=True, assume_centered=False, support_fraction=None, contamination=0.1, random_state=None)
####参数####
提及了MCD算法
store_precision:bool
Specify if the estimated precision is stored
assume_centered : Boolean
If True, the support of robust location and covariance estimates is computed, and a covariance estimate is recomputed from it, without centering the data. Useful to work with data whose mean is significantly equal to zero but is not exactly zero. If False, the robust location and covariance are directly computed with the FastMCD algorithm without additional treatment.
support_fraction : float, 0 < support_fraction < 1
The proportion of points to be included in the support of the raw MCD estimate. Default is None, which implies that the minimum value of support_fraction will be used within the algorithm: [n_sample + n_features + 1] / 2.
contamination : float, 0. < contamination < 0.5
The amount of contamination of the data set, i.e. the proportion of outliers in the data set.
##source##