Kenneth L. Clarkson, Elad Hazan, et al.
Journal of the ACM
Pre-trained deep learning representations have been successful in a wide range of predictive and generative tasks across different domains and input modalities. However, evaluating the emerging "zoo" of pre-trained models for various downstream tasks remains challenging. Our goal is to characterize the internal representation of pre-trained models to better inform data efficiency and sampling, robustness, and interoperability. We propose an unsupervised method to characterize embeddings of pre-trained models through the lens of non-parametric group property-driven subset scanning (SS). While our method is domain-agnostic, we assess its detection capabilities with extensive experiments on diverse molecular benchmarks (ZINC-250K, MOSES, MoleculeNet), across multiple predictive chemical language models (MoLFormer, ChemBERTa), and molecular graph generative models (GraphAF, GCPN). We further evaluate how representations evolve as a result of domain adaptation by finetuning or low-dimensional projection. Our results show a significant presence of disentanglement in the learned space in terms of molecular structure and properties. Experiments reveal notable information condensation in the pre-trained embeddings upon task-specific fine-tuning as well as projection techniques. For example, among the most-common elements in the embedding, only property-driven elements are shared between the two tasks, while of those are unique to each task. This work provides a post-hoc quality evaluation method for representation learning models and domain adaptation methods that is task and modality-agnostic.
Kenneth L. Clarkson, Elad Hazan, et al.
Journal of the ACM
Yuankai Luo, Veronika Thost, et al.
NeurIPS 2023
Aditya Malik, Nalini Ratha, et al.
CAI 2024
Stephen Obonyo, Isaiah Onando Mulang’, et al.
NeurIPS 2023