A relative patterns discovery for enhancing outlier detection in categorical data
Abstract
Outlier (also known as anomaly) detection technology is widely applied to many areas, such as diagnosing diseases, evaluating credit, and investigating cybercrime. Recently, several studies, based on frequent itemset mining (FIM), have been proposed to detect outliers in categorical data. For efficiency, these FIM-based studies pruned (ignored) the majority of data by either imposing a threshold or restricting the length of the pattern or both, and they further adopted the limited information to evaluate observations. In spite of high efficiency, such a pruning approach encounters the problem of distortion, i.e., the accuracy decreases to a low level of discernment or even causes the contrary judgment in certain cases. In this paper, we introduce the concept relative patterns discovery from a new perspective on association analysis. To efficiently explore the relative patterns, we devise a hash-index-based intersecting approach (called the HA). Based on the knowledge of relative patterns, we propose an unsupervised approach (called the UA) to evaluate which observations are anomalous. Instead of using the limited information, our method can differentiate the features of observations without the problem of distortion. The results of the empirical investigation, conducted with eight real-world datasets on the UCI Machine Learning Repository, demonstrate that our method generally outperforms the previous studies not only in accuracy but also in efficiency. We also demonstrate that the execution complexity of our method is significantly efficient, especially in high-dimensional data. Furthermore, our method can represent a natural panorama of data, which is appropriate in controlled experiments for discovering more decisive factors in outlier detection.