The bandit problem models a sequential decision process between a player and an environment. In the beginning, the player has no prior knowledge about the environment, but can gain experience by sequentially pulling an arm out of a set of arms and receiving rewards from the environment. The goal is to maximize the cumulative rewards. In this dissertation, we focus on the stochastic contextual bandit problems, where each arm is associated with a feature vector and the observed reward of each pulled arm follows a stochastic process. The bandit problems enjoy solid theoretical foundations in the literature and have substantial applications in many real tasks such as recommender systems, online advertising, clinical trials, etc. However, there are still several challenges that prevent the success of bandit algorithms. To name a few, ensuring robustness, adaptive hyper-parameter selection, improving computational efficiency and adapting to non-stationarity.
We focus on resolving the underlying challenges of bandit problems, both theoretically and practically. In Chapter 1, we give a brief overview on bandit problems and the limitations in existing works. We propose a two-layer bandit algorithm in Chapter 2 to improve the robustness of stochastic linear contextual bandit algorithms under fully adaptive adversarial attacks. Based on a similar idea, we generalize the two-layer bandit structure to our proposed Syndicated Bandits framework to aid multiple hyper-parameters selection in online contextual bandit problems, where offline tuning is impossible. In Chapter 4 and 5, we address the computational bottlenecks of bandit problems and propose two novel algorithms, SGD-TS and Sampling-MIPS to improve computational efficiency in the estimation and prediction phases of contextual bandit algorithms. SGD-TS is the most efficient generalized linear bandit algorithm so far. In Chapter 6, we emphasize the limitations of the stationarity assumption in the classic bandit problems and propose a changepoint detection based bandit algorithm under a relaxed piecewise stationary bandit environment. We conclude this dissertation and mention some future directions in Chapter 7. All of our proposed algorithms and framework enjoy solid theoretical guarantees and proofs are presented in the Appendix. Extensive experiments in both simulations and real datasets validate the effectiveness of our proposed works.