About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
Big Data 2023
Short paper
Two-sample KS test with approxQuantile in Apache Spark
Abstract
The classical two-sample test of Kolmogorov-Smirnov(KS) is widely used to test whether empirical samples come from the same distribution. Even though most statistical packages provide an implementation, carrying out the test in big data settings can be challenging because it requires a full sort of the data. The popular Apache Spark system for big data processing provides a 1-sample KS test, but not the 2-sample version. Moreover, recent Spark versions provide the approxQuantile method for querying ϵ-approximate quantiles. We build on approxQuantile to propose a variation of the classical Kolmogorov-Smirnov two-sample test that constructs approximate cumulative distribution functions (CDF) from ϵ-approximate quantiles. We derive error bounds of the approximate CDF and show how to use this information to carry out KS tests. Psuedocode for the approach requires 15 executable lines. A Python® implementation appears in the appendix.