Is Space-Time Attention All You Need for Video Understanding?
Authors: Gedas Bertasius, Heng Wang, Lorenzo Torresani
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental study compares different self-attention schemes and suggests that divided attention, where temporal attention and spatial attention are separately applied within each block, leads to the best video classification accuracy among the design choices considered. Despite the radically new design, Time Sformer achieves state-of-the-art results on several action recognition benchmarks, including the best reported accuracy on Kinetics-400 and Kinetics-600. We evaluate Time Sformer on four popular action recognition datasets: Kinetics-400 (Carreira & Zisserman, 2017), Kinetics-600 (Carreira et al., 2018), Something-Something V2 (Goyal et al., 2017), and Diving-48 (Li et al., 2018). |
| Researcher Affiliation | Collaboration | 1Facebook AI 2Dartmouth College. |
| Pseudocode | No | The paper does not contain a section explicitly labeled "Pseudocode" or "Algorithm", nor does it present structured algorithmic steps. |
| Open Source Code | Yes | Code and models are available at: https://github.com/ facebookresearch/Time Sformer. |
| Open Datasets | Yes | We evaluate Time Sformer on four popular action recognition datasets: Kinetics-400 (Carreira & Zisserman, 2017), Kinetics-600 (Carreira et al., 2018), Something-Something V2 (Goyal et al., 2017), and Diving-48 (Li et al., 2018). We adopt the Base Vi T architecture (Dosovitskiy et al., 2020) pretrained on either Image Net-1K or Image Net-21K (Deng et al., 2009). Lastly, we evaluate Time Sformer on the task of long-term video modeling using How To100M (Miech et al., 2019). |
| Dataset Splits | Yes | We evaluate the models on the validation sets of Kinetics-400 (K400), and Something-Something-V2 (SSv2). We randomly partition this collection into 85K training videos and 35K testing videos. |
| Hardware Specification | Yes | We compare the video training time on Kinetics-400 (in Tesla V100 GPU hours) of Time Sformer to that of Slow Fast and I3D. ... We also measured the actual inference runtime on 20K validation videos of Kinetics-400 (using 8 Tesla V100 GPUs). |
| Software Dependencies | No | The paper does not specify version numbers for any software dependencies or libraries used in the experiments. |
| Experiment Setup | Yes | Unless differently indicated, we use clips of size 8 224 224, with frames sampled at a rate of 1/32. The patch size is 16 16 pixels. During inference, unless otherwise noted, we sample a single temporal clip in the middle of the video. We use 3 spatial crops (top-left, center, bottom-right) from the temporal clip and obtain the final prediction by averaging the scores for these 3 crops. |