Shrinking Temporal Attention in Transformers for Video Action Recognition

Authors: Bonan Li, Pengfei Xiong, Congying Han, Tiande Guo1263-1271

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct thorough ablation studies, and achieve state-of-the-art results on multiple action recognition benchmarks including Kinetics400 and Something-Something v2, outperforming prior methods with 50% less FLOPs and without any pretrained model.
Researcher Affiliation Collaboration Bonan Li1*, Pengfei Xiong2*, Congying Han1 , Tiande Guo1 1 University of Chinese Academy of Sciences 2 PCG, Tencent
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide concrete access to source code for the methodology described.
Open Datasets Yes The proposed model is trained and evaluated on the two public large-scale action recognition datasets, Kinetics 400 and Something-Something V2(SSv2).
Dataset Splits Yes For both of these two datasets, the methods are learned on the training set and evaluated on the validation set.
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory) used for running its experiments.
Software Dependencies No The paper mentions using "Adam W optimizer" but does not specify any software names with version numbers (e.g., PyTorch 1.x, TensorFlow 2.x).
Experiment Setup Yes For the temporal domain, we randomly sample a frame from each segment to obtain one input sequence with T = 16, T = 32 or T = 64 frames. Meanwhile, we fix the short side of these frames to 256 and perform data argumentation following MVi T (Fan et al. 2021) to obtain the data with size 224 224 for the spatial domain. Specific implementation details can be found in the appendix.