Catalysts Synthesis Procedures Extraction from Synthesis Paragraphs using Large Language Models
Abstract
Designing new catalysts is essential for the development of new sustainable processes. However, this process, most often, involves trial and error, including repeating tests already performed in other labs and published in the literature. Being able to transform unstructured text from patents and scientific papers into organized procedures enables the usage of machine learning techniques to drive catalyst design and optimization. One structured way of describing synthesis procedures is using action graphs, where each node represents a synthesis action with its conditions. Leveraging Natural Language Processing (NLP) techniques, one can process a synthesis procedure description to create an action graph. Particularly, Large Language Models (LLM) like OpenAI ChatGPT have proven to be able to perform various NLP tasks by employing fine-tuning, few-shot learning, or even zero-shot learning using prompt engineering, presenting themselves as a fast and straightforward approach to extract data from text. For example, Zheng et al$^1$. applied prompt engineering techniques on gpt-3.5-turbo and gpt-4 to achieve an accuracy of 90-99% extracting synthesis conditions from MOFs. Dunn et al$^2$. fine-tuned a gpt-3 model with 500 prompts and reponses to extract records of complex scientific materials chemistry. Bai et al$^3$. applied schemas to control the response output of the LLM and extract data from complex chemistry tables. The majority of these approaches focus on the extraction of scientific parameters, with LLM relying on a single model closed-source proprietary model. In this work, we propose a pipeline using open-source LLMs to extract synthesis procedures from patents and scientific papers. We devised an optimized prompt to get the structured action graphs from general synthesis procedures and evaluate the impact of various models on graph generation. Each action in the graph were defined following the definitions proposed by Vaucher et al$^4$.