Abstract
In this paper, we will examine the problem of classification of massive graph streams. The problem of classification has been widely studied in the database and data mining community. The graph domain poses significant challenges because of the structural nature of the data. The stream scenario is even more challenging, and has not been very well studied in the literature. This is because the underlying graphs can be scanned only once in the stream setting, and it is difficult to extract the relevant structural information from the graphs. In many practical applications, the graphs may be defined over a very large set of nodes. The massive domain size of the underlying graph makes it difficult to learn summary structural information for the classification problem, and particularly so in the case of data streams. In order to address these challenges, we proposed a probabilistic approach for constructing an "in-memory" summary of the underlying structural data. This summary determines the discriminative patterns in the underlying graph with the use of a 2-dimensional hashing scheme. We provide probabilistic bounds on the quality of the patterns determined by the process, and experimentally demonstrate the quality on a number of real data sets. Copyright © SIAM.