Local information is very crucial in many image and video analysis tasks. In this thesis,
we introduce four representative works in exploiting local information. We first introduce a
set of per-pixel labeling datasets, which provide a good platform for studies of using local
information in image analysis. Based on this dataset, we propose a novel segmentation
method which utilizes local appearance consistency for car semantic part parsing task. We
then address the attention issue in video action recognition tasks, by designing a latent
attention module, which is jointly learned with video recognition components. Last, we
improve the attention mechanism to explicitly detect spatial and spatio-temporal regions
that are related to actions (ROIs).