Solid-state synthesis prediction is a key accelerator for the rapid design of advanced inorganic materials. However, determining synthesis variables such as the choice of precursor materials is challenging for inorganic materials because the sequence of reactions during heating is not well understood. To achieve predictive synthesis for the desired material, one potential approach is to learn synthesis design patterns from a large volume of experimental synthesis procedures. Nevertheless, a comprehensive, large-scale database of structured synthesis procedures for inorganic materials does not exist. Provided the ability of converting unstructured text to structured information, the decades of solid-state chemistry literature constitutes a treasure trove of synthesis data. Therefore, this study aims at: (1) developing natural language processing (NLP) algorithms to text mine a large-scale inorganic synthesis dataset from materials science literature, and (2) developing machine learning algorithms for precursor selection in solid-state synthesis based on the text-mined dataset.
Although many general-purpose NLP methods exist, text mining for inorganic synthesis requires dedicated development of models for information retrieval (Chapter 2). During the development of a text-mining pipeline, one major problem is the difficulty of identifying which materials from a synthesis paragraph are precursors or are target materials. In this study, we developed a two-step Chemical Named Entity Recognition (CNER) model to identify precursors and targets, based on information from the context around material entities. By integrating our information retrieval model for precursors and targets, and also the ones for other synthesis variables, we established a fully automated text-mining pipeline that extracts the structured data of synthesis procedures from the literature. Starting from 4,973,165 materials science papers, we applied our text-mining pipeline and successfully extracted 33,343 solid-state synthesis procedures. The quality of the text-mined synthesis dataset is validated by the high accuracy of 93% at the chemistry level, where each extracted reaction has the target and precursor materials consistent with the original literature report. This dataset for inorganic solid-state synthesis is currently the largest of its kind and paves the way toward the development of data-driven approaches for rational synthesis design.
Using the extracted data, we conducted a meta-analysis to study the similarities and differences between precursors in the context of solid-state synthesis (Chapter 3). To quantify precursor similarity, we built a substitution model to calculate the viability of substituting one precursor with another while retaining the target. From a hierarchical clustering of the precursors, we demonstrate that the "chemical similarity" of precursors can be extracted from text data, without the need to include any explicit domain knowledge. Quantifying the similarity of precursors offers a reference for suggesting candidate reactants when researchers alter existing recipes by replacing precursors. The capability of creating alternative recipes constitutes an important step toward developing a predictive synthesis model.
While the selection of alternative precursors is enabled by the similarity of precursors, it is limited to existing materials. To learn which precursors to recommend for the synthesis of a novel target material, we further developed a representation learning model to evaluate the similarity of targets (Chapter 4). The data-driven approach learns "chemical similarity" of target materials and refers the synthesis of a new target to precedent synthesis procedures of similar target materials, mimicking human synthesis design. When proposing five precursor sets for each of 2,654 unseen test target materials, our recommendation strategy achieves a success rate of at least 82%. Our approach captures decades of heuristic synthesis data in a mathematical form, making it accessible for use in recommendation engines and autonomous laboratories.
Overall, this study contributes a valuable large-scale synthesis dataset and interpretable precursor selection algorithms to the materials science community, representing a step forward in the prediction of solid-state synthesis.