Characterizing pre-trained and task-adapted molecular representations
Abstract
Pre-trained deep learning representations have been successful in a wide range of predictive and generative tasks across different domains and input modalities. However, evaluating the emerging "zoo" of pre-trained models for various downstream tasks remains challenging. Our goal is to characterize the internal representation of pre-trained models to better inform data efficiency and sampling, robustness, and interoperability. We propose an unsupervised method to characterize embeddings of pre-trained models through the lens of non-parametric group property-driven subset scanning (SS). While our method is domain-agnostic, we assess its detection capabilities with extensive experiments on diverse molecular benchmarks (ZINC-250K, MOSES, MoleculeNet), across multiple predictive chemical language models (MoLFormer, ChemBERTa), and molecular graph generative models (GraphAF, GCPN). We further evaluate how representations evolve as a result of domain adaptation by finetuning or low-dimensional projection. Our results show a significant presence of disentanglement in the learned space in terms of molecular structure and properties. Experiments reveal notable information condensation in the pre-trained embeddings upon task-specific fine-tuning as well as projection techniques. For example, among the $top-130$ most-common elements in the embedding, only $8$ property-driven elements are shared between the two tasks, while $122$ of those are unique to each task. This work provides a post-hoc quality evaluation method for representation learning models and domain adaptation methods that is task and modality-agnostic.