The development of automated methods capable of detecting and localizing actions is crucial for a variety of applications, ranging from surveillance and autonomous driving to content moderation. This thesis focuses on creating action detection methods that deliver robust performances. At the heart of these methods’ robustness lie two fundamental elements: the detection of atomic actions and the ability for compositional understanding.
Atomic actions are those that are identifiable from a single image or a short video. In this research, we developed innovative methods to detect and localize such actions that achieve state-of-the art performance. The key strength of these methods lies in their ability to refine visual features both spatially and semantically, enabling precise identification of action-specific regions. For scalability, we further developed a multi-branch network to recognize new composition of objects and actions. Our design ensures that each branch learns decoupled features, allowing the network to transfer previously learned concepts to identify new compositions. This approach outperforms existing methods by a good margin as our extensive experiments on benchmark datasets demonstrate. Further, the correct identification of the attributes of the participating objects in actions helps to detect unknown compositions. Therefore, we have created a network utilizing spatially localized learning to correctly associate objects and attributes. This network achieves state-of-the-art performance in object-attribute association on cluttered scenes.
The developed methods in this thesis can do robust action detection at scale and serve as a base for numerous future applications.