The rapid advancement of language models has significantly reshaped the field of machine learning, enabling sophisticated applications across diverse domains. However, the effectiveness of these models is often contingent upon access to large-scale, high-quality datasets, which may not always be available. This thesis explores strategies for training language models in data-constrained scenarios, leveraging both model-generated data and rule-generated data to enhance model performance and generalization.
First, we investigate ELECTRA, a pretraining framework that improves the data efficiency of language model training by utilizing an “adversarial” model to corrupt the training data. Through theoretical and empirical analysis, we identify a critical optimization control problem in the original ELECTRA design, and propose a simple fix to boost its data efficiency further.
Next, we study knowledge distillation, which improves model training by utilizing a “supportive” teacher model to refine the training data. Through theoretical analysis, we demonstrate the effectiveness of knowledge distillation even when the teacher model is trained with the exact same training set as the student model. Motivated by our understanding, we develop a student-oriented teacher training framework that specifically optimizes the teacher model to maximize student performance, rather than its own accuracy.
Finally, we investigate rule-generated data. We focus on weakly-supervised learning, which leverages heuristics to automatically and scalably annotate datasets without human annotation. Through empirical analyses, we identify a key problem of model training on such data labeled by rules - the model is easily biased by the simple heuristics used to annotate the data. We designed a simple method to avoid such bias in model training and greatly improve the effectiveness of weakly supervised learning.
Together, these contributions advance our understanding of efficient language model training and provide practical solutions for scenarios with limited high-quality training data. Our theoretical analyses and empirical results demonstrate the importance of carefully controlling how models learn from generated and rule-based training signals.