Text mining is a powerful approach to efficiently identify and extract information from large amounts of text. Its application to biomedical literature promises to allow researchers the ability to process hundreds and thousands of articles in a way that was never possible before. This new technology, coupled with the increasing amount of published literature and open access literature sources may usher in a new era of meta-research in which computational algorithms can generate new scientific hypotheses from existing information.
Many obstacles, however, stand in the way of truly automated methods for text mining. The difficulty in obtaining full text literature, specialized jargon in scientific research, and the ambiguity of biological entity terms are but a few of the challenges that need to be overcome. That being said, semi-automated text mining methods can still greatly aid researchers by identifying topics of interest, reducing the number of articles to read, and targeting relevant information in articles. The judicious and dedicated use of semi-automated methods has the possibility of having a great impact in efficiently distributing the task of manual reading and processing of scientific literature.
Our contribution to semi-automated methods for biomedical text mining center on the identification and extraction of point mutation information. Point mutations are an integral aspect of protein research, as they are the vehicle of diversity and the key to functional changes in proteins. They are also represented in a format that lends itself to text mining and can be referenced to the growing numbers of biological sequence databases. This dissertation focuses on the ability to parse literature for point mutations and extract their functional effects. We show that, using statistical, graph theoretical, and machine learning methods, we can efficiently transform information that was previously embedded in the text into information that is computationally stored and processable.