Reformulating Zero-shot Action Recognition for Multi-label Actions
Authors: Alec Kerrigan, Kevin Duarte, Yogesh Rawat, Mubarak Shah
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our evaluations show that our method not only achieves strong performance on three single-label action classification datasets (UCF-101, HMDB, and Rare Act), but also outperforms previous ZSAR approaches on a challenging multi-label dataset (AVA) and a real-world surprise activity detection dataset (MEVA). |
| Researcher Affiliation | Academia | Center for Research in Computer Vision, University of Central Florida, Orlando, FL 32816, {aleckerrigan,kevin_duarte}@knights.ucf.edu, {yogesh,shah}@crcv.ucf.edu |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper does not provide any concrete access to source code for the described methodology, nor does it state that the code is open-source or available. |
| Open Datasets | Yes | We train our models on the Kinetics 700 [35] action recognition dataset. We evaluate on the UCF-101 [36], HMDB [37], and Rare Act [38] dataests. The Atomic Visual Actions (AVA) dataset [41] annotates 80 atomic visual actions in 340 15-minute video clips. We train and evaluate on the Multiview Extended Video with Activities (MEVA) dataset [42]. |
| Dataset Splits | Yes | We evaluate on the validation set which contains 64 videos split into 54k one-second clips. The data is split into 22 hours for training and 122 hours are sequestered for the NIST Activity in Extended Video (Act EV) challenge. |
| Hardware Specification | Yes | All experiments are performed on two of Nvidia Tesla V100 GPUs. |
| Software Dependencies | No | The paper mentions software like PyTorch and the Adam optimizer but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | The video encoder is the Py Torch [32] implementation of the R(2+1)D-18 [31] network. This network outputs a visual embedding dimension of Dv = 512 for each 16-frame video clip. We average predictions over 25 clips per video at test time. Our Text Refining module consists of a learned 3-layer MLP with hidden dimensions of 1024 and Re LU activations, and a final output dimension of Dt = 600. The loss of our model is minimized using the Adam optimizer [34] with a starting learning rate of 1e-3 and a batch size of 114. The model is trained for 50 epochs with a learning rate decrease by a factor of 10 at epoch 30. |