This dissertation is a computational investigation of the task of locating and recognizing objects in unconstrained images in real-time, and learning to do so with minimal supervision. We take a probabilistic generative modeling approach, which involves formulating analytical models of several real-world vision problems, studying how optimal inference would proceed under such models, developing techniques for learning parameters under these models, and evaluating the performance of the optimal inference algorithms in realistic data. We begin by developing a novel generative model of images under which an image is a collection of sets of pixels which are generated by different object categories. This provides a novel definition of ̀òbject'' as a set of pixels that are co- dependent, but conditionally independent of the other sets of pixels in the image. We then develop an algorithm for optimal inference (i.e., detection of objects) and maximum likelihood learning when the segmentation of training images is known. We point out a computational tradeoff between robustness of object detection and precision of localization, and propose context dependent detectors as a way to solve the problem. These techniques are used to develop a state-of-the-art, real-time head, eye, and blink detector. We predict that similar context-dependent detectors may be found in the brain. We develop an algorithm for optimal inference and maximum likelihood learning when the segmentation of training images is unknown. We test this on image datasets labeled with the identity but not the location of objects, and achieve state-of-the-art performance in discovery of object categories. We then test the algorithm in a fully unsupervised context, in which a real-time person detector is learned from just a few minutes of visual information self-labeled through multi-modal contingency detection. This suggests that early face (and other) preferences in humans infants may be evidence for rapid statistical learning rather than innate biases. We develop software for learning robust, real-time object detectors from both labeled and unlabeled examples, including a real-time head, eye, and blink detector available to the public