Towards Multi Modal Geospatial Foundation Models With Multiple Resolutions
Abstract
Foundation models are often pre-trained on large datasets and have been valuable in improving efficiency of fine-tuning of language and visual processing tasks. geospatial foundation models have been trained on satellite data sources such as Landsat and Sentinel satellites, and have been applied to enhance the performance of various downstream tasks. Due to the nature of geospatial data, additional information can be gathered from other sources with overlapping temporal and spatial parameters; essentially providing multi-modal input data. Factors such as multiple resolutions, multiple timestamps, and missing data require careful consideration. While existing research has proposed multi-modal methods such as Multi-Modal Masked AutoEncoders (MultiMAE) for the natural image domain, there has been limited research regarding the design of multi-modal geospatial foundation models which can handle these factors. This research proposes a MultiMAE inspired approach to pre-training multi-modal geospatial foundation model with multi-temporal and multi-resolution data. SSL4EO, an earth observation dataset with global coverage and multiple annual timestamps is used for pre-training with temporal augmentation. The input modalities include Sentinel 1, Sentinel 2, land cover classification, elevation, remote sensing indices, and more. Downstream evaluations are made on the GeoBench benchmarking datasets. The downstream effects of factors such as token selection, input modalities and hyper-parameter tuning in the pre-training phase of geospatial foundation models are carefully evaluated.