reproducibilityindex.ai

Is Space-Time Attention All You Need for Video Understanding?

Authors: Gedas Bertasius, Heng Wang, Lorenzo Torresani

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experimental study compares different self-attention schemes and suggests that divided attention, where temporal attention and spatial attention are separately applied within each block, leads to the best video classiﬁcation accuracy among the design choices considered. Despite the radically new design, Time Sformer achieves state-of-the-art results on several action recognition benchmarks, including the best reported accuracy on Kinetics-400 and Kinetics-600. We evaluate Time Sformer on four popular action recognition datasets: Kinetics-400 (Carreira & Zisserman, 2017), Kinetics-600 (Carreira et al., 2018), Something-Something V2 (Goyal et al., 2017), and Diving-48 (Li et al., 2018).
Researcher Affiliation	Collaboration	1Facebook AI 2Dartmouth College.
Pseudocode	No	The paper does not contain a section explicitly labeled "Pseudocode" or "Algorithm", nor does it present structured algorithmic steps.
Open Source Code	Yes	Code and models are available at: https://github.com/ facebookresearch/Time Sformer.
Open Datasets	Yes	We evaluate Time Sformer on four popular action recognition datasets: Kinetics-400 (Carreira & Zisserman, 2017), Kinetics-600 (Carreira et al., 2018), Something-Something V2 (Goyal et al., 2017), and Diving-48 (Li et al., 2018). We adopt the Base Vi T architecture (Dosovitskiy et al., 2020) pretrained on either Image Net-1K or Image Net-21K (Deng et al., 2009). Lastly, we evaluate Time Sformer on the task of long-term video modeling using How To100M (Miech et al., 2019).
Dataset Splits	Yes	We evaluate the models on the validation sets of Kinetics-400 (K400), and Something-Something-V2 (SSv2). We randomly partition this collection into 85K training videos and 35K testing videos.
Hardware Specification	Yes	We compare the video training time on Kinetics-400 (in Tesla V100 GPU hours) of Time Sformer to that of Slow Fast and I3D. ... We also measured the actual inference runtime on 20K validation videos of Kinetics-400 (using 8 Tesla V100 GPUs).
Software Dependencies	No	The paper does not specify version numbers for any software dependencies or libraries used in the experiments.
Experiment Setup	Yes	Unless differently indicated, we use clips of size 8 224 224, with frames sampled at a rate of 1/32. The patch size is 16 16 pixels. During inference, unless otherwise noted, we sample a single temporal clip in the middle of the video. We use 3 spatial crops (top-left, center, bottom-right) from the temporal clip and obtain the ﬁnal prediction by averaging the scores for these 3 crops.