On the Limits of Subsampling of Location Traces
Abstract
Location data collection at a societal scale is increasingly becoming common - examples of this are call and data detail records in telecommunication companies, GPS samples collected by car companies, and GPS samples from mobile devices in mapping companies (e.g., Google, Microsoft). Such large scale mobility datasets have applications in urban planning, network planning, surveillance, and real-time traffic estimations. This paper addresses the problem of subsampling location traces while preserving the amount of information present in such datasets. We present a novel subsampling technique that is based on a hierarchical geographical encoding mechanism (geohash), that allows for efficient spatial cluster sampling. We analyze this subsampling technique through various information theoretic measures to quantify the total 'amount' of information in a dataset from a location trace perspective and evaluate these metrics in the context of two large scale mobility datasets from telecommunication companies - one is that of call detail records and the second is that of data detail records. We show that subsampling data in both these cases by as much as 75% does not significantly reduce the total amount of information, i.e. the dataset can be used similar to the original version. This paves way for the creation of better space and CPU efficient models that can support various applications reliant on collective location traces.