Language-based Action Concept Spaces Improve Video Self-Supervised Learning

Authors: Kanchana Ranasinghe, Michael S Ryoo

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on action recognition datasets showcase state-of-the-art performance for our learned representations under linear-probing, standard zero-shot, and transductive zero-shot settings.
Researcher Affiliation Academia Kanchana Ranasinghe Stony Brook University kranasinghe@cs.stonybrook.edu Michael Ryoo Stony Brook University mryoo@cs.stonybrook.edu
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No Our action descriptions will be released publicly along with our codebase.
Open Datasets Yes We use three standard action recognition benchmark datasets in our experiments: Kinetics400 [70], UCF-101 [71], and HMBD-51 [72].
Dataset Splits Yes Kinetics-400 is a large-scale dataset containing 240,000 training videos and 20,000 validation videos belonging to 400 different action classes.
Hardware Specification Yes We train for 15 epochs using a batch size of 32 across 4 NVIDIA-A5000 GPUs using ADAM-W [76, 77] optimizer on the student model with an initial learning rate of 1e 5 following a cosine decay schedule.
Software Dependencies No The paper mentions using ADAM-W optimizer and building upon SVT and CLIP source code and weights, but does not provide specific version numbers for software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes We train for 15 epochs using a batch size of 32 across 4 NVIDIA-A5000 GPUs using ADAM-W [76, 77] optimizer on the student model with an initial learning rate of 1e 5 following a cosine decay schedule. The EMA teacher is updated from student weights after each training iteration with a decay ratio of 2e 4.