CASTELO — a combined machine learning and molecular modeling for drug discovery and protein-protein interaction optimization
Abstract
Lead optimization is one of the most expensive steps for pre-clinical drug discovery, with a focus on improving drug efficacy and selectivity. We propose a combined machine learning and molecular modeling approach that partially automates lead optimization workflow in silico, providing suggestions to select hot spots for drug modifications. The initial data collection is achieved with physics-based molecular dynamics simulation. Contact matrices are calculated as the preliminary features extracted from the simulations. To take advantage of the temporal information from the simulations, we enhanced contact matrices data with temporal dynamism representation, which are then modeled with unsupervised convolutional variational autoencoder (CVAE). Finally, conventional and CVAE-based clustering methods are compared with metrics to rank the submolecular structures and propose potential candidates for lead optimization. With no need for extensive structure-activity data, our method provides new hints for drug modification hotspots which can be used to improve drug potency and reduce the overall cost of the lead optimization process. In addition to the small molecule hot-spot identifications, we show that CASTELO can also be used for protein-protein interactions, such as the binding system of MHC and antigens. In this use case, CASTELO predictions agree well with experimental binding results and free energy perturbation-predicted binding affinities. Our work supports the usage of structure-based deep learning techniques in antigen-specific immunotherapy designs.