Reinforcement learning (RL) has proven useful for a wide variety of important applications, including robotics, autonomous vehicles, healthcare, finance, gaming, recommendation systems, and advertising, among many others. In general, RL involves training an agent to make decisions based on a reward signal. One of the major challenges in the field is the sparse reward problem, which occurs when the agent receives rewards only occasionally during the training process. This can make conventional RL algorithms difficult to train since the agent does not receive enough feedback to learn the optimal policy. Model-based planning is one potential solution to the sparse reward problem since it enables an agent to simulate their actions and predict the outcome far into the future. However, planning can be computationally expensive or even intractable when too many time steps are required to be internally simulated, due to combinatorial explosion.
To address these challenges, this thesis presents a new RL algorithm that uses a hierarchy of model-based (manager) and model-free (worker) policies to take advantage of the unique advantages of both. The worker takes guidance from the manager in the form of a goal or selected policy. The worker is computationally efficient and can respond to changes or uncertainty in the environment to carry out its task. From the manager’s perspective, this abstracts away the trivially small state transitions, reducing the depth needed for tree search, and greatly improving the efficiency of planning.
Two different applications were used for evaluation of the hierarchical agent. The first is a maze navigation environment, with continuous-state dynamics and unique episodes. This makes the environment extremely challenging for both model-based and model-free algorithms. The performance of the agent was evaluated on multiple platforms for the random maze task, including DeepMind Lab. For the second demonstration, the proposed algorithm was compared against other algorithms with the Arcade Learning Environment, which is a popular RL benchmark. In comparison with state-of-the-art algorithms, the proposed hierarchical algorithm is shown to have a faster convergence and greater sample efficiency on several tasks. Overall, the proposed hierarchical approach is a potential solution to the sparse rewards problem, and may enable RL algorithms to be applied to a wider range of tasks, ultimately leading to better outcomes in various applications.