reproducibilityindex.ai

Shifted Chunk Transformer for Spatio-Temporal Representational Learning

Authors: Xuefan Zha, Wentao Zhu, Lv Xun, Sen Yang, Ji Liu

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct thorough ablation studies to validate each component and hyper-parameters in our shifted chunk Transformer, and it outperforms previous state-of-the-art approaches on Kinetics-400, Kinetics-600, UCF101, and HMDB51.
Researcher Affiliation	Industry	Xuefan Zha Kuaishou Technology zhaxuefan@kuaishou.com Wentao Zhu Kuaishou Technology wentaozhu@kuaishou.com Tingxun Lv Kuaishou Technology lvtingxun@kuaishou.com Sen Yang Kuaishou Technology senyang@kuaishou.com Ji Liu Kuaishou Technology ji.liu.uwisc@Gmail.com
Pseudocode	No	The paper describes the model architecture and components using mathematical equations and block diagrams, but does not include structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide an explicit statement or link for the open-source code of the described methodology.
Open Datasets	Yes	We evaluate our shifted chunk Transformer, denoted as SCT, on ﬁve commonly used action recognition datasets: Kinetics-400 [20], Kinetics-600 [7], Moment-in-Time [31] (Appendix), UCF101 [39] and HMDB51 [25].
Dataset Splits	No	The paper evaluates on commonly used action recognition datasets (Kinetics-400, Kinetics-600, UCF101, HMDB51) which often have predefined splits, but it does not explicitly state the specific train/validation/test split percentages or sample counts used for these experiments in the text.
Hardware Specification	Yes	All the experiments are run on 8 NVIDIA Tesla V100 32 GB GPU cards.
Software Dependencies	No	The paper does not provide specific version numbers for software dependencies.
Experiment Setup	Yes	In the training, we use a synchronous stochastic gradient descent with momentum of 0.9, a cosine annealing schedule [30], and the number of epochs of 50. We use batch size of 32, 16 and 8 for SCT-S, SCT-M and SCT-L, respectively. And the frame crop size is set to be 224 224. For data augmentation, we randomly select the start frame to generate the input clip. In the inference, we extract multiple views from each video and obtain the ﬁnal prediction by averaging the softmax probabilistic scores from these multi-view predictions. The details of initial learning rate, optimization and data processing are shown in Table 2.