Shrinking Temporal Attention in Transformers for Video Action Recognition
Authors: Bonan Li, Pengfei Xiong, Congying Han, Tiande Guo1263-1271
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct thorough ablation studies, and achieve state-of-the-art results on multiple action recognition benchmarks including Kinetics400 and Something-Something v2, outperforming prior methods with 50% less FLOPs and without any pretrained model. |
| Researcher Affiliation | Collaboration | Bonan Li1*, Pengfei Xiong2*, Congying Han1 , Tiande Guo1 1 University of Chinese Academy of Sciences 2 PCG, Tencent |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. |
| Open Datasets | Yes | The proposed model is trained and evaluated on the two public large-scale action recognition datasets, Kinetics 400 and Something-Something V2(SSv2). |
| Dataset Splits | Yes | For both of these two datasets, the methods are learned on the training set and evaluated on the validation set. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory) used for running its experiments. |
| Software Dependencies | No | The paper mentions using "Adam W optimizer" but does not specify any software names with version numbers (e.g., PyTorch 1.x, TensorFlow 2.x). |
| Experiment Setup | Yes | For the temporal domain, we randomly sample a frame from each segment to obtain one input sequence with T = 16, T = 32 or T = 64 frames. Meanwhile, we fix the short side of these frames to 256 and perform data argumentation following MVi T (Fan et al. 2021) to obtain the data with size 224 224 for the spatial domain. Specific implementation details can be found in the appendix. |