Ensembles for unsupervised outlier detection

Proximity based methods

Proximity based methods use distances and/or density estimation to define the outlier score of an observation, being outliers those points which are isolated from the remaining observations. Proximity based methods are strongly based on the computations of similarities or distances between observations, thus defining an appropriate distance metric is a critical step in this class of algorithms (see Section 1.3). Proximity based methods are the most popular type algorithms in the outlier detection literature, mainly due to their simplicity and absence of assumptions related to the underlying data distribution. Similarity is a relative concept that depends on the interpretation of proximity used. Such proximity can be computed using three main methods: nearest-neighbors, densities or clusters.

Clustering based Clustering based methods for outlier detection (Eskin et al., 2002; Khoshgoftaar et al., 2005; Muller et al., 2012b; Ng & Han, 1994; Zhang et al., 1996) are based on the idea, borrowed from the classification field, of cluster detection, the aim is the detection of dense groups of points by assigning each point to a specific cluster; measuring the fit to a cluster is usually done by computing the distances of each point relative to the centroid of all available clusters, an observation is then assigned only to the cluster whose centroid is close. Outliers are reported, in most cases, as a side product of the process, as those points which do not belong to a cluster and using their relative distances to the nearest centroid as outlier scores. The results of this kind of approach can vary between different runs of the algorithm depending on the initial setup of clusters, also the quantity of clusters (k) is a user specified parameter; being outliers reported as a side product of the clustering process, clustering algorithms often fail to detect outliers which are grouped in small clusters. With prior knowledge about the outlying observations in the data it is possible specify a convenient k to detect even outliers grouped in clusters, however, being outlier detection essentially a prevalent unsupervised problem, the heuristic specification of k tend to produce not optimal results.

Multiple iterations of the algorithm and the posterior combination of their outputs are often needed in order to obtain more robust results. Nearest neighbor methods Nearest neighbor methods are based on the measurement of distances between observations (Knox & Ng, 1998; Ramaswamy et al., 2000) by using metrics like Euclidean distances (see Section 1.3). A use specified parameter k is used in order to determine the number of nearest neighbors to examine, being outliers those points with the higher scores computed by averaging the distance of the point to its k nearest neighbor. In clustering methods once the centroid of each cluster is established, it is possible to measure the distance of a new instance only relative to the centroid of the data. Nearest neighbor methods compute distances between all instances in the data, having a higher level of granularity to that of clustering based methods. However, this richer granularity is accompanied for high-processing time, as the pairwise distance between any observations in the data needs to be computed, thus exhibiting a scaling quadratic processing time (O(n2)).Different methods can be used in order to prune some points or portions of the space to reduce the amount of distance computations needed (Angiulli & Pizzuti, 2002; Eskin et al., 2002; Wang et al., 2005).

Combination functions

Different outlier detection algorithms produce scores which are not directly comparable, either by the type of output produced or by the scales of the scores. For example, ABOD (Kriegel et al., 2008)(an outlier detection algorithm based on angles and distances between points) produces scores where the higher values correspond to lower outlier degree, but other algorithms like LOF (Breunig et al., 2000) denote the outlierness degree with higher values. These two previous examples produced values with different ranges and with no upper value. However, one of the seminal algorithms for outlier detection DB-outlier (Knox & Ng, 1998) produces scores limited to the range [0,1]. There is no consensus on which combination function is best, while (Keller et al., 2012) Zimek et al. (2014) argues that an average of the outlier results is better; while Aggarwal (2013a) argues that a maximum function avoids the dilution of scores and instead highlights observations with high outlier scores even on only some of the ensemble members. Aggarwal (2013a) argues that the average function will dilute the outlier behavior of some points whose behavior could have been highlighted only on some set of scores .

Zimek et al. (2014) Disagree that the problem with the maximum function is due mainly that a single score that is far beyond the rest of the conglomerate of scores will decide the final decision , irrespective of the majority of the scores with similar opinion. In our personal opinion, both approaches to score combination have pros and cons, while it is true that in a dataset where all dimensions contribute to the classification of the outlier observations (absence of noisy attributes) the average function provides indeed an advantage by taking a consensus of in theory accurate and diverse results, if we consider outlier detection scenarios where the outlier behavior is buried deep inside the dataset and is only visible using a specific set of dimensions, then the maximum function would be better, as it is capable of detecting that single result based on a subset of dimensions where the outlier observation is visible. We conclude in spite of the promising behavior of the maximum function, in reality it is very difficult to detect the complete set of attributes where a particular outlier is present, being the most common scenario to have different set of outlier scores obtained from subsets of attributes that can or cannot contain some of the relevant attributes, then the average of scores will make more sense by compensating the different error using a consensus of imperfect but hopefully slightly accurate classifiers.

This problem in interpretability of scores was brought to the attention by the authors of in citep- Kriegel2011. While some of the earliest approaches like DB-outlier produced comprehensible scores in the range [0,1], variations from this basic approach like LOF (Breunig et al., 2000) or ABOD (Kriegel et al., 2008) produce scores whose scale has no limits, then making impossible, without expert knowledge, to determine the true outliers. Then an important task to have in consideration is to transform the outlier scores of these types of algorithms to a probability estimate, which will provide a better interpretability for the final user. Thus, the ideal outlier score is regular and normal (Kriegel et al., 2011), regular if S(o) ≥ 0, s(o) ≈ 0 for inliers and s(o)_0 for outliers. There is a variety of papers that discussed the problem of score comparability: Using sigmoid functions and producing as final output probability estimates (Gao & Tan, 2006) , simply transforming into standard deviations (Nguyen et al., 2010), and the third one attempts to approximate the distribution of scores produced by different types of outlier detection algorithms, tailoring the scores depending on a specific distribution (Kriegel et al., 2011). The basic principle in transformation is that the ranking of the scores should not be inverted after the transformation. In Kriegel et al. (2011) several approaches for data normalization and regularization are proposed, where regularization basically transforms an outlier score into the range [0,Inf) and then normalization brings the scores to the scale [0,1]. It is important to consider that the approaches proposed in Kriegel et al. (2011) can be applied as a post step to the scores produced by different outlier detection algorithms.

Outlier located in low dimensional projections

The dimensionality in which an outlier is located determines the level of complexity required in the outlier detector. A detector limiting its search space to a single dimensional analysis, ignoring the relationships between attributes, is able to detect only trivial outliers (Keller et al., 2012), in contrast non-trivial outliers or interesting outliers, the most challenging and critical type of outliers, are usually located in specific subspaces of the data, their outlier behavior is not commonly exhibited in a single dimension, but instead it is revealed only in a specific combination of dimensions. An example is a 20-year-old patient with cancer (a typical outlier as the combination of age and cancer is not common). The age 20 or the presence of cancer are not too uncommon to be considered an interesting outlier. Thus, analyzing these attributes individually does not provide any insights about a potential outlying behavior. A simple extreme value detector, which assumes outliers located in tails of a distribution, focused on individual and independent attributes, will fail to detect the 20-year-old cancer patient. Nonetheless, an analysis of the previous example but using both features unveils that such a combination of age and presence of cancer is sufficiently anomalous as to be considered an outlier.

There are multiple supervised solutions for this lower dimensional problem; however the lack of ground truth labels in its unsupervised analog depicts an interesting challenge. Moreover, the previous example depicted a scenario where the outlier behavior was observed in full dimensionality; however, in most real-world scenarios interesting outliers are located in high-dimensional datasets and their outlier behavior is only observable in specific subspaces of the data. Thus, interesting outliers are neither located in individual dimensions nor in full dimensionality. High-dimensional data impedes a blunt search for outliers based in all the possible combinations of attributes, the processing time increases exponentially as the dimensionality of the data rises. Besides dimensionality, the size of the dataset, also plays an important role to determine the processing time of a detector; however, Filzmoser et al. (2008) argues that “Computation time increases more rapidly with p than with n”, here p stands for dimensionality, whereas n is the number of observations. Using a brute force search process throughout all possible subsets of attributes is infeasible, the quantity of spaces to analyze is 2d-1, in low-dimensional data this does not represent a problem, but as the dimensionality of the data increases the challenge becomes more evident. E.g. if d=2, the number of subspaces are 22=4, but even in a relatively modest dimensionality of 10, the number of subspaces to analyze rises to 210=1024.

Le rapport de stage ou le pfe est un document d’analyse, de synthèse et d’évaluation de votre apprentissage, c’est pour cela chatpfe.com propose le téléchargement des modèles complet de projet de fin d’étude, rapport de stage, mémoire, pfe, thèse, pour connaître la méthodologie à avoir et savoir comment construire les parties d’un projet de fin d’étude.

Table des matières

INTRODUCTION
0.1 Outlier detection
0.2 Problem statement
0.2.1 Diversity of outliers
0.2.2 Hidden behavior of outliers
0.2.3 Impact of distance measures
0.3 Objective and contributions
0.4 Structure of thesis
CHAPTER 1 LITERATURE REVIEW
1.1 Outliers heterogeneity
1.1.1 Diversity of outlier detection algorithms
1.1.2 Outlier detection algorithms based on specific assumptions
1.1.3 Combination functions
1.2 Hidden outlier behavior
1.2.1 The challenges of high-dimensional data
1.2.2 Accuracy and diversity
1.2.3 Bias-variance trade-off
1.2.4 Ensembles for unsupervised outlier detection
1.3 Parameterization in outlier detection
1.3.1 Interaction algorithm – parameters – data
1.3.2 Evaluation methods
1.4 Current limitations
1.4.1 Limitation 1. Inadequacy of an outlier detector to identify different types of outliers
1.4.2 Limitation 2. Lack of computationally inexpensive approaches focused in the detection of outliers hidden in lower dimensional spaces
1.4.3 Limitation 3. Absence of a comprehensive study of the interaction parameter setting – dataset – outlier detection algorithm
CHAPTER 2 AN UNSUPERVISED APPROACH FOR COMBINING SCORES OF OUTLIER DETECTION TECHNIQUES, BASED ON SIMILARITY MEASURES
2.1 Introduction
2.2 Background and related work
2.3 The approaches
2.3.1 General approach
2.3.1.1 EDCV approach
2.3.1.2 EDVV approach
2.3.2 Putting it all together
2.4 Experiments and evaluation
2.4.1 Methods and parameters
2.4.2 Datasets
2.4.3 Results
2.5 Conclusions
CHAPTER 3 BAGGED SUBSPACES FOR UNSUPERVISED OUTLIER DETECTION
3.1 Introduction
3.2.1 Ensemble outlier detection
3.2.2 Feature bagging
3.2.3 Subsampling
3.3 Feature Bagged Subspaces for Outlier Detection (FBSO)
3.3.1 Lower dimensional spaces
3.3.2 Subsampling for density estimation
3.3.3 Feature bagged subspaces
3.4.1 Methods and parameters
3.4.2 Datasets
3.4.3 Results
3.4.3.1 Synthetic data
3.4.3.2 Real world data
3.4.4 Discussion
3.5 Conclusions and future works
CHAPTER 4 ON THE BEHAVIOR OF DISTANCE MEASURES ON UNSUPERVISED ENSEMBLE OUTLIER DETECTION
4.1 Introduction
4.2 Distance measures
4.3 Outlier detection algorithms
4.3.1 Assumptions about the data
4.3.2 Application domain
4.3.3 Availability of labeled data
4.3.4 Parameters required
4.3.5 Type of output
4.4 Outlier ensembles
4.5 Diagnostic tools
4.6 Evaluation
4.6.1 Methods
4.6.2 Datasets
4.6.2.1 Synthetic datasets
4.6.2.2 Real-world datasets
4.6.3 Results
4.6.3.1 Synthetic data
4.6.3.2 Real-world data
4.6.4 Discussion
4.7 Conclusions and future work
CHAPTER 5 GENERAL DISCUSSION
5.1 Detection of outliers using heterogeneous types of detectors
5.2 Detection of outliers in lower-dimensional spaces
5.3 Interaction of algorithm’s parameters and data
CONCLUSION AND RECOMMENDATIONS
BIBLIOGRAPHY