Natural language is one of the most common mediums for people to communicate and natural language processing (NLP) is a subfield in artificial intelligence that enables computers to understand and reason human languages. NLP has many important applications in our daily life, including Amazon Alexa, Apple Siri, and Google Translate. However, state-of-the-art NLP models often require a large amount of labeled data to learn. This labeled-data-hungry property brings challenges when applying these methods in many real-world applications since collecting large-scale labeled data is cost-expensive, time-consuming and sometimes even not possible due to privacy restrictions.
In this dissertation, we aim at alleviating the labeled-data-hungry issue in NLP by leveraging additional unlabeled and synthesized data. We begin with leveraging unlabeled data to improve model performance for natural language understanding tasks. We show that two major semi-supervised approaches, namely task-adaptive pre-training and self-training, are complementary and their performance gains can be strongly additive. Then we move our focus on utilizing synthesized data to facilitate model learning. In this line of work, we first develop an algorithm focusing on how to leverage a structured knowledge base (KB) to teach the common sense reasoning capability of pre-trained language models (PLMs). Specifically, we use KB to construct various logical forms and utilize rules to convert these logical forms into multiple-choice question-answer pairs requiring commonsense logical reasoning to refine PLMs. Next, we aim at evaluating and augmenting the dialogue state tracking module, a core component for task-oriented dialogue systems. In particular, we first train a PLM as data generator and then leverage it to generate additional dialogue data with their labels to evaluate and augment state-of-the-art dialogue state tacking models. Finally, we target improving the reasoning capability of small language models (SLMs) to reduce their gap with large language models (LLMs). Specifically, we utilize LLM prompting to generate explanations and they are used as additional supervision signals to improve SLMs, which are more favorable due to low storage and compute cost in real-world applications.
In the end, we summarize the key findings of this dissertation to improve label efficiency via additional unlabeled and synthesized data, and discuss possible future directions towards the goal of more label-efficient learning in NLP.