The rate of discovery of novel materials has accelerated in recent decades, and the slow step in realizing these materials has long been in synthesis design. Improving our understanding of the sequence of phases formed during reaction, from precursors to target, and how that sequence is affected by the conditions for synthesis would provide researchers with a means to predict the conditions required to reach new desired outcomes. Because of the vast dimensionality in synthesis design space, such a prediction task is suitable for data-driven methods: combining machine learning with real experiment and computational modeling to both generate and test hypotheses that rationalize synthesis pathways. Driving hypothesis generation from data requires a substantial source for historical syntheses; in this thesis, we leverage the availability of synthesis procedures and characterized outcomes from the scientific literature.
Text mining of the scientific literature, using natural language processing as well as manual methods, has been extensively employed in materials science over the past decade, with applications ranging from systematic literature review to generating datasets of material properties to studies in synthesis science. Yet, in the sphere of text mining for synthesis science little attention has been paid to distinguishing syntheses by their outcome (e.g., final phase purity, particle morphology). The availability of such experimental outcome information and subsequent analysis would provide researchers with useful data to develop models that hypothesize the effects of specific synthesis features on desired outcomes. By tapping into the available synthesis literature, this recipe-outcome-paired data would be ample.
Progress in automatic text mining of the scientific literature is persistent, particularly with the recent rise of generative large language models. However, there still remain pitfalls in the reliable extraction of materials synthesis recipes and linking to outcome, making manual curation of such datastets more attractive in some cases. This thesis highlights both (1) advances made in automatic methods for acquiring inorganic synthesis procedures and outcomes from the literature as well as (2) data-driven insights into synthesis science that are gleaned from synthesis datasets extracted manually from the literature in combination with direct experiment and first principles computations.
For (1), we developed robust named entity recognition models for the extraction of synthesis procedure graphs as well as morphological outcomes for nanoparticle synthesis. To demonstrate the application of these methods, we constructed a large-scale text-mined dataset of gold nanoparticle synthesis recipes, which are plentiful in the literature and thus represent a rich source for data-driven synthesis design. Importantly, we include extraction of their morphological outcomes; the inclusion of both input synthesis conditions and the corresponding output makes this dataset valuable for data-driven synthesis science efforts.
For task (2), we focus on the extraction of phase purity outcomes for oxide materials. Impurity phases, in the form of remnant precursors or intermediate phases, offer clues into the synthesis pathway traversed in a reaction. BiFeO3 is an important multiferroic material that is frequently synthesized in the literature and has a strong tendency to coincide with competitive impurity phases when synthesized. We thus pivot our focus in chemistry space to BiFeO3 for this second task. For this system, we demonstrate how text-mined datasets of such recipes and outcomes can be used to inform real experiments and computational modeling that rationalize synthesis pathways.
In this thesis, we endeavor to improve the role of text mining, automated and manual, in data-driven synthesis science. Our attention on extracting synthesis outcomes in addition to their corresponding procedures helps advance this subfield of materials science and paves the way for future efforts to accelerate both progress in our understanding of synthesis mechanisms for existing materials and the discovery of new compounds.