A system for understanding molecular formula patterns for non-stoichiometric, variable inorganic compounds
Abstract
Researchers in inorganic chemistry, especially in battery, superconductor, or ceramics, often describe non-stoichiometric compounds in publications as patterns with variable formulas and range limits to describe a set of compounds with expected similar characteristics. Two examples are “SiLixOy wherein 0.05<x<0.7 and 0.9<y<1.1” or “LiNi1−x MnxO2x (where 0<x<1).” These patterns make search of the literature for specific element ratios difficult because the ranges described in such formulas do not reduce to a single formula with whole number subscripts. To begin expanding the discovery of such specified compounds, IBM created a system for parsing and indexing these non-stoichiometric molecular formulas. The parser can find such patterns and their surrounding text in publications such as patents and articles. A second step breaks the patterns into atomic elements, subscripts, parentheses, lists and range categories and produces a JSON object model, suitable for indexing (e.g. with Solr/Lucene). The index produced enables searching of molecular formulas and patterns, supporting multiple element composition ranges which can be combined with fast full-text search. The same indexing technique is also used to efficiently support physical units, polymer and table searching. The system is designed for iterative improvements and embedded within a larger accelerated discovery environment called CIRCA (Chemical Information Resources for Cognitive Analytics).