Process trace clustering: A heterogeneous information network approach
Abstract
Process mining is the task of extracting information from event logs, such as ones generated from workflow management or enterprise resource planning systems, in order to discover models of the underlying processes, organizations, and products. As the event logs often contain a variety of process executions, the discovered models can be complex and difficult to comprehend. Trace clustering helps solve this problem by splitting the event logs into smaller subsets and applying process discovery algorithms on each subset, resulting in per-subset discovered processes that are less complex and more accurate. However, the state-of-the-art clustering techniques are limited: the similarity measures are not process-aware and they do not scale well to high-dimensional event logs. In this paper, we propose a conceptualization of process's event logs as a heterogeneous information network, in order to capture the rich semantic meaning, and thereby derive better process-specific features. In addition, we propose SeqPathSim, a meta path-based similarity measure that considers node sequences in the heterogeneous graph and results in better clustering. We also introduce a new dimension reduction method that combines event similarity with regularization by process model structure to deal with event logs of high dimensionality. The experimental results show that our proposed approach outperforms state-of-the-art trace clustering approaches in both accuracy and structural complexity metrics.