Enhancing Molecular Properties Prediction through Multi-Stage Causal Feature Selection
Abstract
Machine learning has become a powerful tool in accelerating scientific research. However, the reliability and trustworthiness of machine learning predictions depend on the selection of relevant features used in such models. In this work, we propose a multi-stage systematic approach for selecting molecular descriptors based on their causal relationship with a given property, which not only improves the accuracy of machine learning predictions but also enhances their interpretability. The proposed multi-stage feature selection consists of four different blocks: i) feature extraction using Mordred descriptors, ii) data cleaning, iii) causal feature selection per Mordred descriptors module, and iv) general causal feature selection. The Markov Blanket algorithm is used to construct the causal-effect graphs between the features and the property to predict, resulting in a sub-dataset with meaningful features and their importance over the target. To evaluate our approach, we selected two challenging tasks: predicting the toxicity and biodegradability of chemical compounds through molecular descriptors. The results demonstrate that the proposed multi-stage approach outperformed the state-of-the-art methods while using significantly fewer features for both tasks. The proposed methodology enhances the interpretability of machine learning predictions, making it easier for experts to identify the most relevant features and to understand the underlying mechanisms that govern the behavior of the studied molecules. This approach can be applied to a wide range of scientific problems, and we believe it will play a key role in advancing the field of machine learning in science.