Efficient parallel data mining for association rules
Abstract
In this paper, we develop an algorithm, called PDM, to conduct parallel data mining for association rules. Consider a transaction as a collection of items, and a large itemset is a set of items such that the number of transactions containing it exceeds a pre-specified threshold. PDM is so designed that the global set of large itemsets can be identified efficiently and the amount of inter-node data exchange required is minimized. Specifically, with a given database partition, each processing node will collect (count) information on each itemset from its local database efficiently via a hashing method. The information discovered by each node is next shared with other nodes via some communication schemes. Then, PDM employs a technique, called clue-and-poll, to address the uncertainty due to the partial knowledge collected at each node by judiciously selecting a small fraction of the itemsets for the exchange of count information among nodes, thus reducing the communication cost. The global set of large itemsets can hence be determined based on the aggregate count of itemsets. It is experimentally shown that PDM not only attains very good parallelization efficiencies, but also provides robust performance for various input patterns.