Mekala, Dheeraj

Training Data Curation for Language Models with Weak Supervision

2025

Mekala, Dheeraj
Advisor(s): Shang, Jingbo

Abstract

Large language models (LLMs) are typically trained on millions of annotated samples, incurring substantial annotation costs and computational resources. While neural scaling laws demonstrate that test error decreases as a power law with training data volume, we are approaching the limits of feasibly collectible public data. This thesis investigates efficient alternatives through weak supervision. We explore two kinds of weak supervision: extractive and generative. Extractive weak supervision curate training data from unsupervised pools of data using weak supervision sources, while generative weak supervision leverages pre-trained language models to create synthetic data. We also propose assessing data quality through three key dimensions: diversity, difficulty, and correctness. Additionally, we empirically show that the training dynamics of LLMs provide valuable insights into data quality.

In the extractive weak supervision domain, we present a contextualized weakly supervised text classification framework that utilizes contextualized representations and user-provided seed words to interpret the corpus and derive labeled data. We also demonstrate that metadata can serve as an additional supervision source in our metadata-empowered weakly supervised classification framework.

For generative weak supervision, we showcase how language models and publicly-available question-answering datasets can be leveraged to generate text-classification data. Additionally, we also discuss our work on generating synthetic tool-usage data from scratch with minimal human supervision.

Finally, we analyze training dynamics to understand the data quality and investigate whether quality improvements can reduce quantity requirements. Through the lens of difficulty, diversity, and noise, we observe that extractive weak supervision tends to produce noisy data while generative weak supervision creates less challenging data. To address these limitations, we propose learning-order-based selection to filter noisy data and learning-percentage-based selection to identify difficult examples. We also empirically observe that smaller language models are capable of curating training data for larger language models.

Together, this body of work advances our understanding of what constitutes ``high-quality'' training data while providing cost-effective solutions for data curation. Our theoretical analyses and empirical evaluations demonstrate significant performance improvements achieved through these weak supervision approaches and targeted data selection methods. These approaches not only reduce resource requirements but also establish a framework for quantifying data quality metrics across different weak supervision paradigms. By strategically addressing the inherent limitations of both extractive and generative weak supervision, we provide a comprehensive methodology for optimizing training data that balances quality considerations with practical constraints, ultimately creating more efficient pathways for training robust language models.

Main Content

For improved accessibility of PDF content, download the file to your device.

UC San Diego

Training Data Curation for Language Models with Weak Supervision