Is Space-Time Attention All You Need for Video Understanding?

Authors: Gedas Bertasius, Heng Wang, Lorenzo Torresani

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental study compares different self-attention schemes and suggests that divided attention, where temporal attention and spatial attention are separately applied within each block, leads to the best video classification accuracy among the design choices considered. Despite the radically new design, Time Sformer achieves state-of-the-art results on several action recognition benchmarks, including the best reported accuracy on Kinetics-400 and Kinetics-600. We evaluate Time Sformer on four popular action recognition datasets: Kinetics-400 (Carreira & Zisserman, 2017), Kinetics-600 (Carreira et al., 2018), Something-Something V2 (Goyal et al., 2017), and Diving-48 (Li et al., 2018).
Researcher Affiliation Collaboration 1Facebook AI 2Dartmouth College.
Pseudocode No The paper does not contain a section explicitly labeled "Pseudocode" or "Algorithm", nor does it present structured algorithmic steps.
Open Source Code Yes Code and models are available at: https://github.com/ facebookresearch/Time Sformer.
Open Datasets Yes We evaluate Time Sformer on four popular action recognition datasets: Kinetics-400 (Carreira & Zisserman, 2017), Kinetics-600 (Carreira et al., 2018), Something-Something V2 (Goyal et al., 2017), and Diving-48 (Li et al., 2018). We adopt the Base Vi T architecture (Dosovitskiy et al., 2020) pretrained on either Image Net-1K or Image Net-21K (Deng et al., 2009). Lastly, we evaluate Time Sformer on the task of long-term video modeling using How To100M (Miech et al., 2019).
Dataset Splits Yes We evaluate the models on the validation sets of Kinetics-400 (K400), and Something-Something-V2 (SSv2). We randomly partition this collection into 85K training videos and 35K testing videos.
Hardware Specification Yes We compare the video training time on Kinetics-400 (in Tesla V100 GPU hours) of Time Sformer to that of Slow Fast and I3D. ... We also measured the actual inference runtime on 20K validation videos of Kinetics-400 (using 8 Tesla V100 GPUs).
Software Dependencies No The paper does not specify version numbers for any software dependencies or libraries used in the experiments.
Experiment Setup Yes Unless differently indicated, we use clips of size 8 224 224, with frames sampled at a rate of 1/32. The patch size is 16 16 pixels. During inference, unless otherwise noted, we sample a single temporal clip in the middle of the video. We use 3 spatial crops (top-left, center, bottom-right) from the temporal clip and obtain the final prediction by averaging the scores for these 3 crops.