Visual Attention

Video data typically contains a great deal of redundant data. In continuous scenes, a large percentage of consecutive frames will often remain virtually the same. As such, there is no need to re-analyze those parts of video frames. Visual attention, as its name suggests, is an algorithm that selects areas of interest in a particular frame, and draws the “attention” of our face detector to just those areas. The goal of visual attention then is to minimize the amount of computation on the part of the face detector, while maintaining the overall accuracy of results.
Visual attention accomplishes this by analyzing only those regions of a video frame that have substantially changed from the previous frame. For such regions of change (due, for example, to motion), we analyze and update detection results. For regions that have not undergone significant change, we propagate previous detection results forward. For many types of video, we have found that we can dramaticallly improve computational performance (by as much as a factor of 10x) with virtually no loss in accuracy. Even for scenes with camera motion, we are able to achieve substantial computational savings.
The video below demonstrates visual attention. The bright portions illustrate regions of change, while the dark portions indicate regions that were not re-analyzed by our face detector. Periodically, visual attention re-analyzes the entire video frame to prevent small errors from accumulating over time. For clarity, the frame rate of the video has been intentionally reduced.