reproducibilityindex.ai

Shrinking Temporal Attention in Transformers for Video Action Recognition

Authors: Bonan Li, Pengfei Xiong, Congying Han, Tiande Guo1263-1271

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct thorough ablation studies, and achieve state-of-the-art results on multiple action recognition benchmarks including Kinetics400 and Something-Something v2, outperforming prior methods with 50% less FLOPs and without any pretrained model.
Researcher Affiliation	Collaboration	Bonan Li1, Pengfei Xiong2, Congying Han1 , Tiande Guo1 1 University of Chinese Academy of Sciences 2 PCG, Tencent
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide concrete access to source code for the methodology described.
Open Datasets	Yes	The proposed model is trained and evaluated on the two public large-scale action recognition datasets, Kinetics 400 and Something-Something V2(SSv2).
Dataset Splits	Yes	For both of these two datasets, the methods are learned on the training set and evaluated on the validation set.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory) used for running its experiments.
Software Dependencies	No	The paper mentions using "Adam W optimizer" but does not specify any software names with version numbers (e.g., PyTorch 1.x, TensorFlow 2.x).
Experiment Setup	Yes	For the temporal domain, we randomly sample a frame from each segment to obtain one input sequence with T = 16, T = 32 or T = 64 frames. Meanwhile, we ﬁx the short side of these frames to 256 and perform data argumentation following MVi T (Fan et al. 2021) to obtain the data with size 224 224 for the spatial domain. Speciﬁc implementation details can be found in the appendix.