CAST: Cross-Attention in Space and Time for Video Action Recognition

Authors: Dongho Lee, Jongseo Lee, Jinwoo Choi

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate the proposed method with extensive experiments on public benchmarks with different characteristics: EPIC-KITCHENS-100, Something-Something-V2, and Kinetics-400.
Researcher Affiliation Academia Kyung Hee University, Republic of Korea
Pseudocode No The paper does not contain a dedicated pseudocode section or a clearly labeled algorithm block. It describes the architecture and operations using text and diagrams, but not in a pseudocode format.
Open Source Code Yes The code is available at https://github.com/KHU-VLL/CAST.
Open Datasets Yes Action recognition. We evaluate the CAST on two public datasets for conventional action recognition: Something-Something-V2 (SSV2) [19] and Kinetics-400 (K400) [24]. Fine-grained action recognition. We evaluate the CAST on the fine-grained action recognition task: EPIC-KITCHENS-100 (EK100) [10].
Dataset Splits Yes The dataset is split into train/val/test, with 168K/24K/27K and have 174 human-objects interaction categories. ... EK100... are split into train/val/test sets of 67K/10K/13K.
Hardware Specification Yes We conduct all the experiments with 16 NVIDIA Ge Force RTX 3090 GPUs.
Software Dependencies No The paper mentions using Py Torch and building upon the existing codebase of Video MAE [56], and utilizing the Deep Speed 3 library. However, it does not specify version numbers for any of these software components.
Experiment Setup Yes We sample 16 frames from each video to construct an input clip. ... We then perform random cropping and resizing every frame into 224 224 pixels. We use the Adam W [39] optimizer with momentum betas of (0.9, 0.999) [7] and a weight decay of 0.05. By default, we train the model for 50 epochs, with the cosine annealing learning rate scheduling [38] and a warm-up period of 5 epochs. The default base learning rate, layer decay [2], and drop path are set to 0.001, 0.8, and 0.2, respectively. ... We set the batch size per GPU as 6 with update frequency of 2.