Today's real-world databases typically contain millions of items with many thousands of fields. As a result, traditional distribution-based outlier detection techniques have more and more restricted capabilities and novel k-nearest neighbors based approaches have become more and more popular. However, the problems with these k-nearest neighbors based methods are that they are very sensitive to the value of k, may have different rankings for top n outliers, are very computationally expensive for large datasets, and doubts exist in general whether they would work well for high dimensional datasets. To partially circumvent these problems, we propose in this paper a new global outlier factor and a new local outlier factor and an efficient outlier detection algorithm developed upon them that is easy to implement and can provide competing performances with existing solutions. Experiments performed on both synthetic and real data sets demonstrate the efficacy of our method.
Aiming to discover observations that deviate from other observations so much as to arouse suspicions that they are generated by a different mechanism, outlier detection has become an important data mining task . Being applied in many different fields such as intrusion detection for cyber-security [2,3], fraud detection for credit cards, insurance and tax , early detection of disease outbreaks in the medical field , fault detection in sensor networks for monitoring health, traffic, machine status, weather, pollution, surveillance , and so on [7,8], it has generated enormous interests and many techniques have been developed for this purpose in recent years, namely distribution-based approaches, depth-based approaches, distance-based approaches, density-based approaches and clustering-based approaches.
State-of-the-art k-nearest neighbors (kNN) based outlier detection algorithms, such as many distance-based and density-based methods, have demonstrated various ways to filter out the normal data and locate the small number of outliers. While they are simple to implement, other aspects concerning the algorithms are worth further exploration. Firstly, these methods usually return only top n outliers with two values. One is the outlier factor (also referred to as score in this paper) and the other is the ranking of the points according to the scores. Therefore, different methods may have different rankings for top n outliers. Secondly, it has been observed that k-nearest neighbors based outlier detection methods are sensitive to the parameter k and a small change in k can lead to changes in the scores and, correspondingly, the ranking. As a result, except for very strong outliers where the scores are distinct, the ranking is sensitive to k as well. Thirdly, for modern large datasets with N data items, the O(N log N) running time involved in the exact search of the k-nearest neighbors has to be improved significantly. Finally, in high dimensional space, data points become equally close to each other, raising the so-called problem of “curse of dimensionality”, and the issue exists whether the notion of these outlier definitions is still meaningful for the highdimensional data.
To meet these challenges and to explore the impact of dimensionality on these kNN-based outlier detection algorithms, in this paper, we propose a new minimum spanning tree (MST)-inspired k-nearest neighbors (kNN)-based outlier detection method that continues the work presented in  and is both computationally efficient and competent with the state-of-the-art outlier detection techniques. Basically, our method starts by finding k-nearest neighbors for each data point. Then a mini MST is constructed upon each data point and its k-nearest neighbors. Finally a small number of outliers are identified relatively within each mini MST using our proposed outlier scores. It is based on the observation that one good point of MST clusteringbased outlier detection methods reside in its taking distances between data points in an MST (rather than the distances to the kth or k nearest neighbor(s)) into account when clustering, thus making the outlier scores relatively less sensitive to k. However, the construction of an exact MST requires quadratic running time and is very computationally expensive for modern large datasets. Therefore, the aim of this hybridization (i.e., MST clustering-based outlier detection and kNN based outlier detection) is to increase the robustness and consistency of the detecting results and to significantly decrease the computation time of MSTclustering based outlier detection process. Our first contribution in this paper is two newly proposed MST-inspired kNN-based outlier scores, a global one and a local one, and an outlier detection method developed upon them. Our second contribution is the significant tempo efficiency of the proposed method through the application of an efficient approximate k-nearest neighbors’ search framework to our kNN-based outlier detection techniques. This is an advantage over a preliminary work of this research presented in , which requires a construction of an approximate minimum spanning tree very close to a true one. Our third contribution is a study of a set of current outlier detection algorithms when applied to some high dimensional datasets. Finally, to be as general as possible, our algorithm has no specific requirements on the dimensionality of data sets and can be applied to outlier detection in large highdimensional data sets. A number of experiments on both synthetic and real data sets demonstrate the robustness and efficiency of the proposed approach in comparison with several state-of-the-art outlier detection algorithms.
The rest of the paper is organized as follows. In Section 2, we review some existing work on state-of-the-art outlier detection approaches. We then present our proposed approaches in Section 3. In Section 4, an empirical study is conducted. Finally, conclusions are made in Section 5.
2. Related work
Related work in this paper falls into three main categories: distance -and density-based outlier detection methods, MST clustering-based outlier detection techniques, and outlier detection techniques for high dimensional data.
2.1. Distance-based and density-based outlier detection
Many efforts have been devoted to detecting outliers. Distance-based outlier detection method was originally proposed by Knorr and Ng in 1998 as an improvement over distribution based methods. Given a distance measure defined on a feature space, “an object O in a dataset T is a DB(p,D)-outlier if at least a fraction p of the objects in T lies greater than distance D from O”, where the term DB(p, D)-outlier is a shorthand notation for a Distance-Based outlier (DB-outlier) detected using parameters p and D . To suit for different theoretical and practical purposes, two kNN-based variants have been developed. “Given two integers, n and k, Distance-Based outliers are the data items whose distance to their kth nearest neighbor is among top n largest ones ” (referred to as “DB-MAX” in the following), and “Given two integers, n and k, Distance-Based outliers are the data items whose average distance to their k-nearest neighbors is among top n largest ones ” (referred to as “DB” in the following).
Though simple and elegant, distance-based outlier detection techniques work well for data sets that contain one or more clusters with similar densities and can detect more globally-orientated outliers. However, many real world data sets often have complex structures. A classic situation illustrating this deficiency is shown in Fig. 1, where o1 and o2 are global outliers and can easily be detected by distance-based methods while o3 is a local one and cannot.