Shifted Chunk Transformer for Spatio-Temporal Representational Learning

Authors: Xuefan Zha, Wentao Zhu, Lv Xun, Sen Yang, Ji Liu

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct thorough ablation studies to validate each component and hyper-parameters in our shifted chunk Transformer, and it outperforms previous state-of-the-art approaches on Kinetics-400, Kinetics-600, UCF101, and HMDB51.
Researcher Affiliation Industry Xuefan Zha Kuaishou Technology zhaxuefan@kuaishou.com Wentao Zhu Kuaishou Technology wentaozhu@kuaishou.com Tingxun Lv Kuaishou Technology lvtingxun@kuaishou.com Sen Yang Kuaishou Technology senyang@kuaishou.com Ji Liu Kuaishou Technology ji.liu.uwisc@Gmail.com
Pseudocode No The paper describes the model architecture and components using mathematical equations and block diagrams, but does not include structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement or link for the open-source code of the described methodology.
Open Datasets Yes We evaluate our shifted chunk Transformer, denoted as SCT, on five commonly used action recognition datasets: Kinetics-400 [20], Kinetics-600 [7], Moment-in-Time [31] (Appendix), UCF101 [39] and HMDB51 [25].
Dataset Splits No The paper evaluates on commonly used action recognition datasets (Kinetics-400, Kinetics-600, UCF101, HMDB51) which often have predefined splits, but it does not explicitly state the specific train/validation/test split percentages or sample counts used for these experiments in the text.
Hardware Specification Yes All the experiments are run on 8 NVIDIA Tesla V100 32 GB GPU cards.
Software Dependencies No The paper does not provide specific version numbers for software dependencies.
Experiment Setup Yes In the training, we use a synchronous stochastic gradient descent with momentum of 0.9, a cosine annealing schedule [30], and the number of epochs of 50. We use batch size of 32, 16 and 8 for SCT-S, SCT-M and SCT-L, respectively. And the frame crop size is set to be 224 224. For data augmentation, we randomly select the start frame to generate the input clip. In the inference, we extract multiple views from each video and obtain the final prediction by averaging the softmax probabilistic scores from these multi-view predictions. The details of initial learning rate, optimization and data processing are shown in Table 2.