Mobile coverage maps consist of various key performance indicators such as the received signal signal strength levels per location, and are of great importance to cellular operators. However they are expensive to obtain, incomplete or inaccurate in some locations, imperfectly reflective of call quality outcomes and potentially constructed from biased samples. In this dissertation, we develop a principled machine learning framework for predicting missing values of mobile coverage maps. It provides the knobs for operators to express their objectives and preferences, as well as tools for data valuation.
First, we develop a prediction framework based on random forests (RFs) to improve signal strength maps from limited measurements. The proposed RFs-based predictor utilizes a rich set of features including but not limited to location, time, cell ID and device hardware, which are considered jointly for the first time. We show that our RFs-based predictor can significantly improve the tradeoff between prediction error and number of measurements needed compared to state-of-the-art data-driven predictors, i.e., requiring 80% less measurements for the same prediction accuracy, or reduces the relative error by 17% for the same number of measurements.
Second, we extend the framework beyond signal strength and mean square error (MSE) minimization to provide knobs to operators to (i) optimize prediction for coverage maps quality outcomes such as coverage indicators and call drop probability; and (ii) deal with sampling bias. We show that we can improve the relative error for the call drop probability up to 32% in the high CDP regime of greatest concern to cellular operators, which corresponds to improvement of signal strength prediction itself in its low values regime. Similarly, we improve recall from 76% to 92% for predictions of coverage loss, where false negatives are costly to operators. We also introduce weight functions that allow operators to specify which points are more important to predict accurately. We propose a reweighting scheme to obtain unbiased error metrics in settings for which the available signal strength data is not sampled proportionally to the target distribution of interest. We demonstrate a benefit of up to 20% of training models with reweighted errors for two intuitive cases: (i) uniform loss with respect to spatial area; and (ii) loss proportional to user population density. Combining both techniques shows improvement up to 5%.
Third, we apply, for the first time, the notion of data Shapley valuation in the context of mobile coverage maps prediction. We demonstrate data valuation for various operators metrics and we show how our reweighted errors fit naturally the data Shapley framework. Assessing the data Shapley values of training data points enables improving prediction, data minimization, and pricing of mobile data. For instance, we are able to remove up to 65% of the low valued training data points and simultaneously improve the recall of coverage loss from 64% to 99%.
Throughout this thesis, we leverage two types of real-world mobile (LTE) datasets to evaluate our methods and gain valuable insights: the first was collected at our university campus by an android App we developed and the second provided by a mobile crowdsourcing company for NYC and LA metropolitan areas, including approximately 11 million measurements. Our work can be useful for mobile analytics companies and cellular operators, particularly in the context of the upcoming 5G deployments.