The eternal sunshine of the sketch data structure
Abstract
In the past years there has been significant research on developing compact data structures for summarizing large data streams. A family of such data structures is the so-called sketches. Sketches bear similarities to the well-known Bloom filters [B.H. Bloom, Space/time trade-offs in hash coding with allowable errors, Communications of ACM, 13 (7) (1970), 422-426] and employ hashing techniques to approximate the count associated with an arbitrary key in a data stream using fixed memory resources. One limitation of sketches is that when used for summarizing long data streams, they gradually saturate, resulting in a potentially large error on estimated key counts. In this work, we introduce two techniques to address this problem based on the observation that real-world data streams often have many transient keys that appear for short time periods and do not re-appear later on. After entering the data structure, these keys contribute to hashing collisions and thus reduce the estimation accuracy of sketches. Our techniques use a limited amount of additional memory to detect transient keys and to periodically remove their hashed values from the sketch. In this manner the number of keys hashed into a sketch decreases, and as a result the frequency of hashing collisions and the estimation error are reduced. Our first technique in effect slows down the saturation process of a sketch, whereas our second technique completely prevents a sketch from saturating.1The phrase "eternal sunshine" in the title reflects that our techniques mitigate or halt the saturation process of a sketch.1 We demonstrate the performance improvements of our techniques analytically as well as experimentally. Our evaluation results using real network traffic traces show a reduction in the collision rate ranging between 26.1% and 98.2% and even higher savings in terms of estimation accuracy compared to a state-of-the-art sketch data structure. To our knowledge this is the first work to look into the problem of improving the accuracy of sketches by mitigating their saturation process. © 2008 Elsevier B.V. All rights reserved.