Maintaining Bernoulli samples over evolving multisets
Abstract
Random sampling has become a crucial component of modern data management systems. Although the literature on database sampling is large, there has been relatively little work on the problem of maintaining a sample in the presence of arbitrary insertions and deletions to the underlying dataset. Most existing maintenance techniques apply either to the insert-only case or to datasets that do not contain duplicates. In this paper, we provide a scheme that maintains a Bernoulli sample of an underlying multiset in the presence of an arbitrary stream of updates, deletions, and insertions. Importantly, the scheme never needs to access the underlying multiset. Such Bernoulli samples are easy to manipulate, and are well suited to parallel processing environments. Our method can be viewed as an enhancement of the "counting sample" scheme developed by Gibbons and Matias for estimating the frequency of highly frequent items. We show how the "tracking counters" used by our maintenance scheme can be exploited to estimate population frequencies, sums, and averages in an unbiased manner, with lower variance than the usual estimators based on a Bernoulli sample. The number of distinct items in the multiset can also be estimated without bias. Finally, we discuss certain problems of subsampling and merging that a rise in systems with limited memory resources or distributed processing, respectively. Copyright 2007 ACM.