Reformulating Zero-shot Action Recognition for Multi-label Actions

Authors: Alec Kerrigan, Kevin Duarte, Yogesh Rawat, Mubarak Shah

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our evaluations show that our method not only achieves strong performance on three single-label action classification datasets (UCF-101, HMDB, and Rare Act), but also outperforms previous ZSAR approaches on a challenging multi-label dataset (AVA) and a real-world surprise activity detection dataset (MEVA).
Researcher Affiliation Academia Center for Research in Computer Vision, University of Central Florida, Orlando, FL 32816, {aleckerrigan,kevin_duarte}@knights.ucf.edu, {yogesh,shah}@crcv.ucf.edu
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code No The paper does not provide any concrete access to source code for the described methodology, nor does it state that the code is open-source or available.
Open Datasets Yes We train our models on the Kinetics 700 [35] action recognition dataset. We evaluate on the UCF-101 [36], HMDB [37], and Rare Act [38] dataests. The Atomic Visual Actions (AVA) dataset [41] annotates 80 atomic visual actions in 340 15-minute video clips. We train and evaluate on the Multiview Extended Video with Activities (MEVA) dataset [42].
Dataset Splits Yes We evaluate on the validation set which contains 64 videos split into 54k one-second clips. The data is split into 22 hours for training and 122 hours are sequestered for the NIST Activity in Extended Video (Act EV) challenge.
Hardware Specification Yes All experiments are performed on two of Nvidia Tesla V100 GPUs.
Software Dependencies No The paper mentions software like PyTorch and the Adam optimizer but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes The video encoder is the Py Torch [32] implementation of the R(2+1)D-18 [31] network. This network outputs a visual embedding dimension of Dv = 512 for each 16-frame video clip. We average predictions over 25 clips per video at test time. Our Text Refining module consists of a learned 3-layer MLP with hidden dimensions of 1024 and Re LU activations, and a final output dimension of Dt = 600. The loss of our model is minimized using the Adam optimizer [34] with a starting learning rate of 1e-3 and a batch size of 114. The model is trained for 50 epochs with a learning rate decrease by a factor of 10 at epoch 30.