CAST: Cross-Attention in Space and Time for Video Action Recognition
Authors: Dongho Lee, Jongseo Lee, Jinwoo Choi
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate the proposed method with extensive experiments on public benchmarks with different characteristics: EPIC-KITCHENS-100, Something-Something-V2, and Kinetics-400. |
| Researcher Affiliation | Academia | Kyung Hee University, Republic of Korea |
| Pseudocode | No | The paper does not contain a dedicated pseudocode section or a clearly labeled algorithm block. It describes the architecture and operations using text and diagrams, but not in a pseudocode format. |
| Open Source Code | Yes | The code is available at https://github.com/KHU-VLL/CAST. |
| Open Datasets | Yes | Action recognition. We evaluate the CAST on two public datasets for conventional action recognition: Something-Something-V2 (SSV2) [19] and Kinetics-400 (K400) [24]. Fine-grained action recognition. We evaluate the CAST on the fine-grained action recognition task: EPIC-KITCHENS-100 (EK100) [10]. |
| Dataset Splits | Yes | The dataset is split into train/val/test, with 168K/24K/27K and have 174 human-objects interaction categories. ... EK100... are split into train/val/test sets of 67K/10K/13K. |
| Hardware Specification | Yes | We conduct all the experiments with 16 NVIDIA Ge Force RTX 3090 GPUs. |
| Software Dependencies | No | The paper mentions using Py Torch and building upon the existing codebase of Video MAE [56], and utilizing the Deep Speed 3 library. However, it does not specify version numbers for any of these software components. |
| Experiment Setup | Yes | We sample 16 frames from each video to construct an input clip. ... We then perform random cropping and resizing every frame into 224 224 pixels. We use the Adam W [39] optimizer with momentum betas of (0.9, 0.999) [7] and a weight decay of 0.05. By default, we train the model for 50 epochs, with the cosine annealing learning rate scheduling [38] and a warm-up period of 5 epochs. The default base learning rate, layer decay [2], and drop path are set to 0.001, 0.8, and 0.2, respectively. ... We set the batch size per GPU as 6 with update frequency of 2. |