Chen, Dongjie

This item is not available for download from eScholarship

Privacy-Aware Synthetic Data Generation and Knowledge Transfer for Machine Learning

2024

No data is associated with this publication.

Abstract

The increasing adoption of machine learning systems across institutions and domains has highlighted critical challenges in data sharing and privacy protection. While sharing diverse labeled data from multiple sources helps generalize machine learning models, privacy concerns and regulatory requirements often restrict direct data sharing, particularly for sensitive multimedia data like medical images and behavioral videos. Current approaches using differential privacy (DP) or cryptography face limitations in either requiring significant trade-off between accuracy and privacy or supporting limited post-processing capabilities.

This dissertation proposes novel frameworks for privacy-aware synthetic data generation and knowledge transfer in distributed machine learning systems. The work makes two major contributions: First, we introduce privacy-aware synthetic multimedia data generation techniques that protect data privacy before sharing while enabling downstream tasks. Second, we develop a Reliability-based Curriculum Learning (RCL) framework that leverages pre-trained multimodal large language models (MLLMs) to enhance domain adaptation without requiring source data access.

The development of these frameworks offers considerable benefits for privacy-preserving machine learning, providing: (a) comprehensive privacy protection across data transmission, computation, and human labeling stages, (b) improved synthetic data generation methods that are more stable and produce higher quality outputs, and (c) effective adaptation strategies for knowledge transfer between domains without compromising sensitive information. This work enables faster and higher quality data sharing with privacy protection, reduced annotation effort, and new opportunities for secure collaboration between institutions.

Main Content

UC Davis

Privacy-Aware Synthetic Data Generation and Knowledge Transfer for Machine Learning

This item is under embargo until August 18, 2025.