Temporal language grounding (TLG) aims to localize query-related events in videos, which explores how to cognize relationships of video content with language descriptions. According to selective visual attention mechanism in cognitive science, people's cognition and understanding of what happens often rely on dynamic foreground information in the video. Nonetheless, background usually predominates the scenes so that query-related visual features and irrelevant ones are confused. Thus, we propose a Foreground Enhanced Network (FEN) to diminish the background effect from two aspects.
FEN at first in spatial dimension explicitly models the evolving foreground in video features by removing relatively unchanged background content. Besides, we propose a progressive contrastive sample generation module to gradually learn the differences between the predicted proposal and its elongated proposals that include the former as a portion, thereby distinguishing similar neighborhood frames. Experiments on two common-used datasets show the efficacy of our model.