Scene understanding is one of the holy grails of computer vision. Despite decades of research on scene understanding, it is still considered an unsolved problem. The difficulty arises mainly because of the huge space of possible images. We require models to capture this variability of scenes and their constituents (e.g., objects) given the limited memory resources. Additionally, we require efficient learning and inference techniques for our models to find the optimal solution in the enormous space of possible solutions.
In this thesis, we propose a set of novel techniques for object detection, segmentation, and contextual reasoning and take a further step towards the ultimate goal of holistic scene understanding. In particular, we propose a compositional method for representing objects and show inference can be performed for an exponential number of objects in linear time. Subsequently, we propose a series of discriminative learning methods for object detection and segmentation and show that our methods achieve the state-of-the-art performance on difficult benchmarks in the computer vision community. Finally, through a series of hybrid human-machine experiments, we try to identify bottlenecks in scene understanding to better guide future research efforts in this area.