Reinforcement Learning (RL), particularly Deep Reinforcement Learning (DRL), has demonstrated remarkable success across various domains, including robotics, Atari games, and AlphaGo. However, these achievements have largely been confined to simulated environments. Translating RL successes into practical, real-world applications remains challenging due to the need for strict constraint satisfaction and the difficulties of learning efficiently from limited real-world data.
In this dissertation, we discuss two key challenges often occur in real-world RL: constraint satisfaction and sample efficiency.(1) Constraint Satisfaction: Real-world systems often impose constraints, such as resource or operational limits. These constraints may originate from the system itself (e.g., restricting the range of a robotic arm or maintaining minimum battery levels) or from the environment (e.g., avoiding dynamic obstacles or limiting vehicle speeds).
(2) Sample Efficiency: While powerful, RL algorithms tend to be data-hungry, often requiring millions of samples to perform effectively—an impractical demand for real-world systems where data collection is costly or infeasible.
To address the constraint satisfaction problem, we review existing approaches to constrained RL, categorizing constraints into two types: cumulative constraints and instantaneous constraints. We also discuss the challenges and opportunities in this field. Moreover, we propose a novel method, Interior-Point Policy Optimization (IPO), designed to handle cumulative constraints in Constrained Markov Decision Processes (CMDPs). IPO transforms constrained optimization problems into unconstrained ones by incorporating logarithmic barrier functions as penalties in the RL objective, effectively accommodating constraints. Beyond theoretical contributions, we also formulate several real-world problems as CMDPs and solve them with the constrained RL techniques, including resource allocation in the 5G system and time-series smoothing with constraints, etc. In practice, resource allocation in 5G is always subject to constraints such as latency, average data rate, QoS, etc. due to the variety of services provided. In time-series smoothing, we impose restrictions on data correction to preserve as much local information as possible during the smoothing process. We define the states, actions, rewards, and costs carefully in accordance with the real-world applications. IPO proves effective in addressing cumulative constraints, while instantaneous constraints are managed by introducing a projection layer at the end of the policy network, which projects infeasible actions onto a feasible action space at each decision step.
To improve the sample efficiency of RL, we investigate the uncertainty-driven exploration in RL. In Adventurer: Exploration with BiGAN for Deep Reinforcement Learning, we propose a novelty-driven exploration strategy based on Bidirectional Generative Adversarial Networks (BiGAN). BiGAN structures prove advantageous for estimating the novelty of complex, high-dimensional states, such as image-based states. Intuitively, a BiGAN that has been well-trained on the visited states should only be able to reconstruct or generate a state from the distribution of visited states. By combining pixel-level reconstruction error and feature-level discriminator feature matching error, this method refines state novelty estimation. Building on this, we observed that most existing RL exploration algorithms underestimate the uncertainty by only focusing the local uncertainty of the next immediate reward. Thus, in Farsighter: Efficient Multi-step Exploration for Deep Reinforcement Learning, we propose a multi-step uncertainty exploration framework to explicitly control the bias-variance trade-off of the value function estimation. Specifically, Farsighter considers the uncertainty of exact k future steps, with the ability to adjust k adaptively. In practice, we learn Bayesian posterior over Q-function in discrete action spaces and over action in continuous spaces to approximate uncertainty in each step and recursively deploy Thompson sampling on the learned posterior distribution. In our most recent work, Look Before Leap: Look-Ahead Planning with Uncertainty in Reinforcement Learning, we extend the idea of multi-step exploration to the model-based setting. We propose a novel framework for uncertainty-aware policy optimization with model-based exploratory planning. In the model-based planing phase, we introduce an uncertainty-aware k-step lookahead planning approach to guide action selection, striking a balance between model uncertainty and value function error. Simultaneously, the policy optimization phase component employs an uncertainty-driven exploratory policy to gather diverse training samples, leading to improved model accuracy and overall RL agent performance.