Dagdelen, John; Dunn, Alexander; Lee, Sanghoon; Walker, Nicholas; Rosen, Andrew S; Ceder, Gerbrand; Persson, Kristin A; Jain, Anubhav

doi:10.1038/s41467-024-45563-x

Download PDF

Structured information extraction from scientific text with large language models

2024

Published Web Location

https://doi.org/10.1038/s41467-024-45563-x

Creative Commons 'BY' version 4.0 license

Abstract

Extracting structured knowledge from scientific text remains a challenging task for machine learning models. Here, we present a simple approach to joint named entity recognition and relation extraction and demonstrate how pretrained large language models (GPT-3, Llama-2) can be fine-tuned to extract useful records of complex scientific knowledge. We test three representative tasks in materials chemistry: linking dopants and host materials, cataloging metal-organic frameworks, and general composition/phase/morphology/application information extraction. Records are extracted from single sentences or entire paragraphs, and the output can be returned as simple English sentences or a more structured format such as a list of JSON objects. This approach represents a simple, accessible, and highly flexible route to obtaining large databases of structured specialized scientific knowledge extracted from research papers.

Many UC-authored scholarly publications are freely available on this site because of the UC's open access policies. Let us know how this access is important for you.

Main Content

For improved accessibility of PDF content, download the file to your device.

UC Berkeley

Structured information extraction from scientific text with large language models

Published Web Location