Matheus Esteves Ferreira, Jaione Tirapu Azpiroz, et al.
ACS Fall 2022
Researchers in inorganic chemistry, especially in battery, superconductor, or ceramics, often describe non-stoichiometric compounds in publications as patterns with variable formulas and range limits to describe a set of compounds with expected similar characteristics. Two examples are “SiLixOy wherein 0.05<x<0.7 and 0.9<y<1.1” or “LiNi1−x MnxO2x (where 0<x<1).” These patterns make search of the literature for specific element ratios difficult because the ranges described in such formulas do not reduce to a single formula with whole number subscripts. To begin expanding the discovery of such specified compounds, IBM created a system for parsing and indexing these non-stoichiometric molecular formulas. The parser can find such patterns and their surrounding text in publications such as patents and articles. A second step breaks the patterns into atomic elements, subscripts, parentheses, lists and range categories and produces a JSON object model, suitable for indexing (e.g. with Solr/Lucene). The index produced enables searching of molecular formulas and patterns, supporting multiple element composition ranges which can be combined with fast full-text search. The same indexing technique is also used to efficiently support physical units, polymer and table searching. The system is designed for iterative improvements and embedded within a larger accelerated discovery environment called CIRCA (Chemical Information Resources for Cognitive Analytics).
Matheus Esteves Ferreira, Jaione Tirapu Azpiroz, et al.
ACS Fall 2022
Shivashankar Subramanian, Ioana Baldini, et al.
IAAI 2020
Gabriele Picco, Lam Thanh Hoang, et al.
EMNLP 2021
Kevin Gu, Eva Tuecke, et al.
ICML 2024