Privacy-Aware Synthetic Data Generation and Knowledge Transfer for Machine Learning
- Chen, Dongjie
- Advisor(s): Chuah, Chen-Nee;
- Cheung, Sen-Ching
Abstract
The increasing adoption of machine learning systems across institutions and domains has highlighted critical challenges in data sharing and privacy protection. While sharing diverse labeled data from multiple sources helps generalize machine learning models, privacy concerns and regulatory requirements often restrict direct data sharing, particularly for sensitive multimedia data like medical images and behavioral videos. Current approaches using differential privacy (DP) or cryptography face limitations in either requiring significant trade-off between accuracy and privacy or supporting limited post-processing capabilities.
This dissertation proposes novel frameworks for privacy-aware synthetic data generation and knowledge transfer in distributed machine learning systems. The work makes two major contributions: First, we introduce privacy-aware synthetic multimedia data generation techniques that protect data privacy before sharing while enabling downstream tasks. Second, we develop a Reliability-based Curriculum Learning (RCL) framework that leverages pre-trained multimodal large language models (MLLMs) to enhance domain adaptation without requiring source data access.
The development of these frameworks offers considerable benefits for privacy-preserving machine learning, providing: (a) comprehensive privacy protection across data transmission, computation, and human labeling stages, (b) improved synthetic data generation methods that are more stable and produce higher quality outputs, and (c) effective adaptation strategies for knowledge transfer between domains without compromising sensitive information. This work enables faster and higher quality data sharing with privacy protection, reduced annotation effort, and new opportunities for secure collaboration between institutions.