Visual attention reflects the sampling strategy of the visual system. It is of great research interest not only because of its mysterious nature as a biological system, but also because of its potential benefit to computer vision and graphics. Psychologists have investigated visual attention for many decades by psychophysical experiments such as visual search tasks. Sophisticated mathematical models have been built to account for the wide variety of human performance data. With the development of eye movement tracking system, where people fixate when they perform certain tasks can be explicitly recorded and provide straightforward evidence of what people pay attention to. Computational models are emerging fast in recent years that take complex images and videos as input and generate saliency maps which predict what attracts people's attention. In particular, there sees a trend of building principled statistic models that have explicit optimization goals. However, there seems to be a canyon between these two lines of research although both seeks to better understand visual attention. Visual search models are often designed to work with well controlled stimuli with distinct target and distractors, and are not applicable to complex images and videos. On the other hand, saliency algorithms are not supported by theories that can account for the variety of human data in visual search. In this dissertation, we make an effort of developing a visual attention theory from first principles. Our goal is to have a framework that combines the virtues of both visual attention models and saliency algorithms. We address the following issues to achieve our goal: (1) We develop a Bayesian framework of saliency by considering what the visual system is trying to optimize when directing attention. Bottom-up saliency emerges naturally as the self-information of visual features. Unlike existing saliency measures, which depend on the statistics of the particular image being viewed, our measure of saliency is derived from natural statistics. Our Bayesian framework also facilitates the incorporation of top-down effects. The measure of overall saliency in visual search, which combines the bottom-up saliency with top-down knowledge of the target's appearance, emerges from our model as the pointwise mutual information between the observed visual features and the presence of a target. (2) Based on the theory, we implemented bottom-up saliency algorithms for both static images and dynamic scenes. In our model saliency is computed locally, which is consistent with the neuroanatomy of the early visual system and results in an efficient algorithm with few free parameters. They demonstrates good performance at predicting human fixations during free-viewing of images and videos. A real time version of dynamic saliency is implemented on a robotic camera. When the camera is oriented toward salient regions, the chance of seeing people is greatly improved. (3) Our saliency framework account for feature search, conjunction search and many search asymmetries straightforwardly. We further examine given a saliency map, how attention is directed. We treat this as a multi-bandit decision making problem and propose that attention is directed probabilistically with the strategy of probability matching. We also treat the visual search task as a sequential decision making problem when investigating when subjects terminate a trial. Taken together, we were able to account for many observations of mistakes and response time in visual search tasks. Together these contributions made efforts toward a unified statistical model of visual attention that not only account for human behavior, but also allows practical implementation on complex images and videos