Learning with large models has driven unprecedented advancements across diverse fields of machine learning. As model's size grows the capacity of the model to memorize or interpolate the dataset also increases. Learning under interpolation presents new challenges and opportunities which are not addressed in classical statistical learning theory. In this thesis, we explore the performance of learning methods in the interpolation regime across various models, including linear models and neural networks. Our primary goal is to understand how data and model characteristics influence the convergence behavior of gradient-based methods such as gradient descent and to quantify how well these models generalize to new data.
In the first section, we explore linear models, which are the simplest examples where learning under interpolation can be studied. In particular, we consider empirical risk minimization methods applied on high-dimensional generalized linear models and Gaussian-mixtures. Our goal is to understand the optimal test error performance for such models in an asymptotic set-up where the data-dimension is comparable to the number of training samples. By deriving a system of equations which precisely characterises the test error performance, we are able to find a tight lower-bound on the test error which holds for any convex loss function and ridge-regularization parameter. We then show the bound is tight by proposing a loss function and regularization parameter which achieves the bound. As a corollary, we are able to approximately quantify the sub-optimality of least-squares depending on the data-model.
Continuing with linear models, we consider adversarial learning with high-dimensional Gaussian-mixture models. Adversarial training, based on empirical risk minimization, currently represents one of the main approaches for defending against adversarial attacks, which involve small but targeted modifications to test data that result in misclassification. We derive precise asymptotic expressions for both standard and adversarial test errors under $\ell_p$ bounded perturbations within a Gaussian mixture model framework. Our results yield exact error formulas that demonstrate the relationship between adversarial and standard errors and the influence of factors such as the over-parameterization ratio, the data model, and the attack budget.
In the next part of the thesis, we aim to extend our theoretical findings to neural networks. Neural nets are known for their ability to memorize even complex datasets, often achieving near-zero training loss via gradient descent optimization. Despite this capability, they also demonstrate remarkable generalization to new data. We investigate the generalization error (i.e., the gap between training and test errors) of neural networks trained with logistic loss. Our main finding reveals that under a specific data-separability condition, optimal test loss bounds are achievable if the network width is only poly-logarithmically large with respect to the number of training samples. Moreover, our analysis framework which is based on algorithmic stability presents improved generalization bounds and width lower bounds compared to prior works employing alternative methods such as uniform convergence via Rademacher complexity.
Next in chapter five, we again consider the problem of learning two-layer neural networks in the interpolating regime, discussing the role of large-step sizes in speeding up the training. Particularly, we consider the Normalized Gradient Descent (NGD) algorithm where the step-size is chosen inversely proportional to the loss. NGD has proven effective in accelerating the convergence of exponentially-tailed loss functions, such as exponential and logistic losses, particularly for linear classifiers handling separable data. We demonstrate that for exponentially-tailed losses and two-layer neural nets, NGD achieves a linear convergence rate of the training loss towards the global optimum, provided the iterates identify an interpolating model. This is facilitated by our proof of gradient self-boundedness conditions and the establishment of a log-Lipschitz property. Additionally, we address the generalization capabilities of normalized GD for convex objectives through an algorithmic-stability analysis, showing that it avoids overfitting during training by providing finite-time generalization bounds.
In the final section, we consider the decentralized learning scenario where the data is kept locally among several computing agents which are communicating their parameters over a graph.
Our study focuses on decentralized learning in overparameterized settings, where models achieve zero training loss, specifically examining the properties of decentralized gradient descent (DGD) on separable data. Our research provides new finite-time generalization bounds for DGD, extending existing knowledge predominantly focused on centralized learning scenarios. Additionally, we develop enhanced gradient-based methods for decentralized learning with separable data, demonstrating significant orders of magnitude of speed-up compared to previous methods.
These results offer new insights and tools for understanding and improving learning in the interpolation regime across various model architectures and learning paradigms.