Learning from literature-extracted synthesis actions for organic synthesis
Abstract
With advances in computational models for retrosynthetic analysis in recent years, it is now possible to routinely propose reliable synthetic routes for target molecules of interest. However, synthesizing such molecules accordingly has remained largely manual: both the formulation of adequate experimental steps and the actual synthesis in the laboratory rely on chemists’ knowledge and experience gathered in decades of practice. To further accelerate chemical discovery and enable the automated synthesis of any suggested synthetic route (defined in terms of reagents and intermediates), one must be able to determine, in an automated fashion, the sequence of operations necessary to execute any reaction step in the laboratory. Given the reaction knowledge accumulated in the literature over many decades, data-driven strategies provide a natural approach for this task. Nevertheless, such strategies require data in a machine-friendly format, which is not readily available in the literature: experimental procedures are usually reported in prose. Our recent work addresses these challenges in the following way. First, we design a transformer-based machine learning model to extract experimental actions from text. This model is pre-trained on a corpus of 2M sentences and associated actions obtained with a rule-based model, and fine-tuned on a set of more than 2000 hand-annotated sentences. Second, the fine-tuned model is applied to experimental procedures from patents to generate a data set of 0.8M chemical equations and associated action sequences, with which, in a third step, we train another machine learning model predicting experimental operations for arbitrary reactions given in SMILES format. Finally, we present how these machine learning models can be coupled with commercial chemical robots for autonomously synthesizing molecules.