reproducibilityindex.ai

CAST: Cross-Attention in Space and Time for Video Action Recognition

Authors: Dongho Lee, Jongseo Lee, Jinwoo Choi

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate the proposed method with extensive experiments on public benchmarks with different characteristics: EPIC-KITCHENS-100, Something-Something-V2, and Kinetics-400.
Researcher Affiliation	Academia	Kyung Hee University, Republic of Korea
Pseudocode	No	The paper does not contain a dedicated pseudocode section or a clearly labeled algorithm block. It describes the architecture and operations using text and diagrams, but not in a pseudocode format.
Open Source Code	Yes	The code is available at https://github.com/KHU-VLL/CAST.
Open Datasets	Yes	Action recognition. We evaluate the CAST on two public datasets for conventional action recognition: Something-Something-V2 (SSV2) [19] and Kinetics-400 (K400) [24]. Fine-grained action recognition. We evaluate the CAST on the fine-grained action recognition task: EPIC-KITCHENS-100 (EK100) [10].
Dataset Splits	Yes	The dataset is split into train/val/test, with 168K/24K/27K and have 174 human-objects interaction categories. ... EK100... are split into train/val/test sets of 67K/10K/13K.
Hardware Specification	Yes	We conduct all the experiments with 16 NVIDIA Ge Force RTX 3090 GPUs.
Software Dependencies	No	The paper mentions using Py Torch and building upon the existing codebase of Video MAE [56], and utilizing the Deep Speed 3 library. However, it does not specify version numbers for any of these software components.
Experiment Setup	Yes	We sample 16 frames from each video to construct an input clip. ... We then perform random cropping and resizing every frame into 224 224 pixels. We use the Adam W [39] optimizer with momentum betas of (0.9, 0.999) [7] and a weight decay of 0.05. By default, we train the model for 50 epochs, with the cosine annealing learning rate scheduling [38] and a warm-up period of 5 epochs. The default base learning rate, layer decay [2], and drop path are set to 0.001, 0.8, and 0.2, respectively. ... We set the batch size per GPU as 6 with update frequency of 2.