Mobile devices have access to rich personal, potentially sensitive data, from online activity and multiple sensors that include personally identifiable information (PII), such as user identifiers, device identifiers, location, health data etc. Mobile crowdsourcing (MCS) is a prevalent practice today: a large number of mobile devices upload measurements to a server, often including the location where they were collected. This data is used to provide various services (including spatiotemporal maps of cellular/WiFi coverage, sentiment, occupancy, COVID-related information etc.) but also poses privacy threats due to untrusted servers and/or third party sharing. In this thesis, first, we design and launch a user study in order to better understand of not just the extent of personally identifiable information (PII) exposure, but also its context (i.e., functionality of the app, destination server, encryption used, etc.) and the risk perceived by mobile users today on a real-world dataset from 400 apps. Next, we consider two applications of learning from mobile crowdsourced data, using the Federated Learning (FL) framework: FL improves privacy by allowing mobile devices to collaboratively train a global model, while keeping their training data local and only exchanging model parameters with the server.
First, we consider training classifiers that predict PII exposures and advertising requests in mobile data packets, and use them to block those packets on mobile devices. While such classifiers have been previously trained in a centralized way, we propose, for the first time, a federated packet classification framework and we demonstrate its effectiveness in terms of classification performance, communication and computation cost via evaluation on three real-world datasets. Methodological challenges include model and feature selection, and tuning the federated learning parameters. We also design, for the first time, two privacy attacks based on HTTP features for an honest-but-curious server and demonstrate one mitigation approach, where the aggregation of sufficient users can limit the attack's effectiveness.
Second, we consider training mobile signal strength maps based on crowdsourced measurements. State-of-the-art trained centralized models using location and other features to predict signal strength. In this work, we apply online federated learning to this problem, since mobile users move around and collect measurements over time. We consider a Deep Leakage from Gradients (DLG) attack, where an honest-but-curious server can infer information about an individual user’s trajectory based on its gradient updates. Such DLG attacks have been studied before only for image/text data and are applied for the first time in this setting. We evaluate the effect of various FL parameters, we show that averaging of gradients provides some protection against such DLG attacks in our setting, and we propose an algorithm that can further improve the privacy-utility tradeoff by selecting which data to include in a batch and use for local training.