MS-CLIP: Multi-spectral Vision Language Learning for Earth Observation
Abstract
Recent Vision-Language Models (VLMs) have enabled a wide range of new tasks in the general vision domain, such as zero-shot classification and cross-modal retrieval. However, existing VLMs are limited to RGB data and do not leverage the full potential of multi-spectral satellite data. We used continual pre-training with the CLIP model [1] to create a first-of-its-kind VLM that can process multi-spectral data, focusing on low-resolution satellite imagery from Sentinel-2. Our model, MS-CLIP, employs the dual encoder architecture of CLIP with an adapted patch embedding for multi-spectral input. The model is trained with contrastive learning that minimizes the distance between vision and text embeddings. Additionally, this work includes building a large-scale image-caption dataset with 900k multi-spectral samples from the SSL4EOS12 dataset [2]. We developed a captioning pipeline using LLaMA3-LLaVA-NeXT [3] to automatically generate captions based on the RGB channels and Overture Maps base layer tags. A subset of the captions was assessed by domain experts to validate our synthetic data generation. Trained on this large-scale dataset, MS-CLIP demonstrates state-of-the-art performance on zero-shot EO tasks. The ViT-B/16 model reaches a zero-shot classification accuracy of 63% on EuroSAT, outperforming vanilla CLIP by over 10pp. The text-to-image retrieval performance increased by 14pp. to 61% mAP@100. We plan to open-source the dataset and model weights in the future. The model can be used to build multi-spectral zero-shot segmentation models and multi-modal LLMs for Earth Observation, which interpret satellite images beyond the visual spectrum and enable innovative applications. References: [1] Radford, A. et al. (2021). Learning transferable visual models from natural language supervision. In International conference on machine learning. [2] Wang, Y. et al. (2023). SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. In IEEE Geoscience and Remote Sensing Magazine. [3] Li, B. et al. (2024). Llava-next: Stronger llms supercharge multimodal capabilities in the wild.