Topology-Driven Completion of Chemical Data
Abstract
Efficient discovery of novel functional materials strongly relies on development of the exploration strategies in the chemical space. In this contribution we are describing an approach that helps to identify lacunae in the existing data and complete the missing pieces in a targeted manner. The proposed approach consists of 2 main steps. Step 1: We perform topological data analysis on the set of molecules. Molecular set is treated as a point cloud with pair-wise distances defined using standard chemoinformatic approaches. We apply the Mapper algorithm [1] to produce a simplified description of the data in a form of a graph G=(C,E), where C is a set of clusters represented as nodes, and E is the set of all edges. Each node in the graph represents a cluster of molecules, and an edge between 2 clusters indicates that the clusters overlap. Mapper graph directly visualizes such aspects of the data shape as holes and branches (loops and flares on the Mapper graph, respectively), which indicate that there is missing data. Our task, therefore, is to generate new molecules that fill out the loops (elimination of the branches can be done in a similar fashion). First, we search for the scaffolds in the dataset that are most promising for the molecule generation. Operationally, we prioritize scaffolds found in small clusters along big loops on the Mapper graph. For each cluster we identify the smallest (in terms of hops) loop l using the Dijkstra's algorithm [2], then we calculate the length of the cycle wl, which is the sum of the length of each edge in the loop. For each molecule in each cluster we calculate the Bemis-Murcko scaffolds S={s1,s2,...sn}. For each scaffold s we introduce the generative potential gs = Σcs mean(|C|) wlc /|ci| where cs are all the clusters where the scaffold appears, and the wlc is the length of the cycle for that cluster. A high gs value indicates that the scaffold appears in small clusters with large cycle length, meaning they are part of bigger loops, and have a higher potential for generating novel molecules. We normalize the gs values between 0 and 1. Step 2: To generate novel molecules that complete loops on Mapper graph we adapt an existing graph-generative model for scaffold-based molecular design, using variational autoencoders (VAE). Here, the input scaffold is extended by sequentially adding atoms and bonds, i.e., the generation is conditioned on the existing scaffolds, making sure that all generated molecules contain the input scaffold. We modify the standard VAE loss function to take in consideration the generative potential of the scaffolds: L = (1 - gs) (Lr + LKM + α (gs - gsn)), where α is a hyperparameter between 0 and 1, gsn is the generative potential of the scaffold of the newly generated molecule, Lr is the reconstruction error of the input s and generated scaffold sn and LKL is the Kullback-Leibler divergence between the prior and approximate posterior distribution. With this loss function we reduce the influence of the scaffolds with low generative potential and penalize the model when it generates molecules with low generative potential gsn. To calculate the gsn of a new scaffold, we include the newly generated molecule in the Mapper analysis after each iteration. We will discuss the application and performance of the described approach to the exploration of the space of photo-acid generators. [1] Gurjeet, S., Mémoli, F., & Carlsson, G. E. "Topological methods for the analysis of high dimensional data sets and 3d object recognition." SPBG (2007). [2] Dijkstra, Edsger W. "A note on two problems in connexion with graphs." Numerische mathematik 1.1 (1959): 269-271. [3] Lim, J., Hwang, S. Y., Moon, S., Kim, S., & Kim, W. Y. "Scaffold-based molecular design with a graph generative model." Chemical Science (2020)