Publication
ACS Fall 2024
Talk

MorganGen: Generative Modeling of SMILES Using Morgan Fingerprint Features

Abstract

This paper studies a method for generating SMILES representations of chemical compounds by harnessing hash-based substructure fingerprints, namely, extended-connectivity fingerprints (ECFP). SMILES are a string-based representation of the molecular graphs. Our innovative generative model capitalizes on Morgan fingerprints, circular fingerprints predicated on extensive connectivity patterns, to encapsulate molecular structural information. The potential applications of our generative model encompass SMILES generation embedding anticipated substructure encoded within the fingerprint and navigating SMILES generation within Morgan fingerprint space via Markov Chain Monte Carlo. Through the utilization of transformer architecture, we encode and decode Morgan fingerprints to generate valid and diverse SMILES sequences. We train the generative model on a vast database comprising 100M molecules and validate the efficacy of our approach through comprehensive experiments across various datasets, highlighting the model's proficiency in generating chemically valid compounds with diverse structural compositions. The pretrained model is made publicly available on Huggingface, and the source code for training the model with Zinc data is released as open source.

Date

Publication

ACS Fall 2024