Integrating Chemical Language and Physicochemical Features for Enhanced Molecular Property Prediction with Multimodal Language Models
Abstract
Machine learning has become a popular and efficient tool for predicting molecular properties in drug discovery and material engineering. While traditional models were trained on predefined descriptors like molecular fingerprints or geometric features, recent models now focus on chemical language representations as they can facilitate the creation of molecules that exhibit specific desired characteristics. In the domain of molecular property prediction such as the estimation of the biodegradability of organic molecules and the toxicity of per- and polyfluoroalkyl substances (PFAS), the scarcity of labeled data is a major challenge that limits the effectiveness of supervised training for machine learning models. The cost associated with labeling molecules and the vastness of the space of plausible chemicals that require labeling further exacerbates this issue. Therefore, there is a growing need for molecular representation learning that can be generalized to various property prediction tasks in an unsupervised or self-supervised setting. Overall, chemical language-based machine learning has become a widely adopted approach for predicting molecular properties due to its efficiency and ability to accurately represent important structural features of molecules. Recent advances in large transformers-based foundation models have demonstrated promising results in learning task-agnostic based on chemical language representations through pre-training on large unlabeled corpora and subsequent fine-tuning on downstream tasks of interest. While pre-trained Language Models have emerged as viable options for predicting molecular properties, they are still in their early stages of development, and there remains a need for further research to improve their performance and address limitations such as generalization and sample efficiency. Here we present a novel multimodal language model (MultiModal-MoLFormer) approach for predicting molecular properties, which combines chemical language representation embeddings derived from the recently introduced MoLFormer chemical language model and physicochemical features. Our approach employs a causal multi-stage feature selection method that selects physicochemical features based on their direct causal-effect on a specific target property to predict. Specifically, we use Mordred descriptors as physicochemical features and Markov blanket causal graphs as the inference algorithm to identify the most relevant features. Our results demonstrate that our proposed approach outperforms existing state-of-the-art algorithms, including the chemical language-based MoLFormer and graph neural networks, in predicting complex tasks such as the biodegradability of general compounds and PFAS toxicity estimation. The MultiModal-MoLFormer model resulted in a significant improvement in the classification accuracy for EPA categories of PFAS Toxicity, from 0.75 to 0.84, when compared to the base MoLFormer approach. Additionally, our proposed approach achieves an accuracy of 0.94 for the biodegradability estimation task. These findings offer a promising direction for enhancing the accuracy and effectiveness of molecular property prediction. By incorporating both chemical language and physicochemical features, our approach has the potential to make significant contributions to drug discovery, materials science, and other fields that depend on accurate prediction of molecular properties.