Data Leakage in Applying Geospatial Foundation Models
Abstract
Use of geospatial foundation models, which were pre-trained on massive unlabeled geospatial data through self-supervised learning, are getting known to be a promising approach to enhancing performance in downstream machine-learning tasks. Example use-cases include classification, segmentation, or pixel-wise regression of satellite-observed multispectral images to detect flooding, wildfire scars, and land use. Supervised machine-learning models could perform better with their parameters initialized with pre-trained parameters from a masking-based self-supervised learning, for example. This paper warns the pitfall of so-called data leakage in applying geospatial foundation models. Data leakage happens when the pre-training data contains information about the target, but similar information will not be available when the downstream model is used for prediction. This leads to high performance on the validation data but not in production.