Alignment-guided Temporal Attention for Video Action Recognition

Authors: Yizhou Zhao, Zhenyang Li, Xun Guo, Yan Lu

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on multiple benchmarks demonstrate the superiority and generality of our module. ... 4 Experiments 4.1 Experimental Setting Benchmarks. We employ two widely used video action recognition datasets, i.e., Kinetics-400 (K400) [22] and Something-Something V2 (SSv2) [14], in our experiments. ... 4.2 Comparison Results ... 4.3 Ablation study
Researcher Affiliation Collaboration Yizhou Zhao1 Zhenyang Li2 Xun Guo3 Yan Lu3 1Carnegie Mellon University 2Tsinghua University 3Microsoft Research Asia
Pseudocode No The paper describes its methods using mathematical equations and diagrams, but does not provide structured pseudocode or algorithm blocks.
Open Source Code Yes 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes]
Open Datasets Yes We employ two widely used video action recognition datasets, i.e., Kinetics-400 (K400) [22] and Something-Something V2 (SSv2) [14], in our experiments.
Dataset Splits Yes Kinetics-400 contains 240k training videos and 30k validation videos in 400 classes of human actions. Something-Something V2 consists of 168.9K training videos and 24.7K validation videos for 174 classes. We provide the top-1 and top-5 accuracy on the validation sets, the inference complexity measured with FLOPs, and the model capacity in terms of the number of parameters.
Hardware Specification No The paper does not provide specific details regarding the hardware used for its experiments. The self-evaluation checklist also confirms 'Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [No]'.
Software Dependencies No The paper mentions models and optimizers (e.g., 'Time Sformer', 'SGD') but does not specify any software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes Implementation details. We use Time Sformer [3] with the officially released model pretrained on Image Net-21K [23] as our baseline. ... We adopt SGD to optimize our network for 30 epochs with a mini-batch size of 64. The initial learning rate is set to 0.005 with 0.1 decays on the 21st and 27th epochs. All patch embeddings are applied with a weight decay of 1e 4, while the class tokens and the positional embeddings used no weight decay. ... The resolution of 224 224 is used throughout all the experiments.