reproducibilityindex.ai

Reformulating Zero-shot Action Recognition for Multi-label Actions

Authors: Alec Kerrigan, Kevin Duarte, Yogesh Rawat, Mubarak Shah

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our evaluations show that our method not only achieves strong performance on three single-label action classiﬁcation datasets (UCF-101, HMDB, and Rare Act), but also outperforms previous ZSAR approaches on a challenging multi-label dataset (AVA) and a real-world surprise activity detection dataset (MEVA).
Researcher Affiliation	Academia	Center for Research in Computer Vision, University of Central Florida, Orlando, FL 32816, {aleckerrigan,kevin_duarte}@knights.ucf.edu, {yogesh,shah}@crcv.ucf.edu
Pseudocode	No	The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code	No	The paper does not provide any concrete access to source code for the described methodology, nor does it state that the code is open-source or available.
Open Datasets	Yes	We train our models on the Kinetics 700 [35] action recognition dataset. We evaluate on the UCF-101 [36], HMDB [37], and Rare Act [38] dataests. The Atomic Visual Actions (AVA) dataset [41] annotates 80 atomic visual actions in 340 15-minute video clips. We train and evaluate on the Multiview Extended Video with Activities (MEVA) dataset [42].
Dataset Splits	Yes	We evaluate on the validation set which contains 64 videos split into 54k one-second clips. The data is split into 22 hours for training and 122 hours are sequestered for the NIST Activity in Extended Video (Act EV) challenge.
Hardware Specification	Yes	All experiments are performed on two of Nvidia Tesla V100 GPUs.
Software Dependencies	No	The paper mentions software like PyTorch and the Adam optimizer but does not provide specific version numbers for any software dependencies.
Experiment Setup	Yes	The video encoder is the Py Torch [32] implementation of the R(2+1)D-18 [31] network. This network outputs a visual embedding dimension of Dv = 512 for each 16-frame video clip. We average predictions over 25 clips per video at test time. Our Text Reﬁning module consists of a learned 3-layer MLP with hidden dimensions of 1024 and Re LU activations, and a ﬁnal output dimension of Dt = 600. The loss of our model is minimized using the Adam optimizer [34] with a starting learning rate of 1e-3 and a batch size of 114. The model is trained for 50 epochs with a learning rate decrease by a factor of 10 at epoch 30.