Semi-supervised Active Learning for Video Action Detection

Authors: Ayush Singh, Aayush J Rana, Akash Kumar, Shruti Vyas, Yogesh Singh Rawat

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the proposed approach on three different benchmark datasets, UCF-101-24, JHMDB-21, and Youtube-VOS. First, we demonstrate its effectiveness on video action detection where the proposed approach outperforms prior works in semi-supervised and weakly-supervised learning along with several baseline approaches in both UCF101-24 and JHMDB21. Next, we also show its effectiveness on Youtube-VOS for video object segmentation demonstrating its generalization capability for other dense prediction tasks in videos.
Researcher Affiliation Academia Ayush Singh1, Aayush J Rana2, Akash Kumar2, Shruti Vyas2, Yogesh Singh Rawat2 1 IIT (ISM) Dhanbad 2University of Central Florida ayush.s.18je0204@cse.iitism.ac.in, {aayushjungbahadur.rana, akash.kumar, shruti, yogesh}@ucf.edu
Pseudocode No The paper does not contain a pseudocode block or a clearly labeled algorithm section. Figure 1 provides a high-level overview diagram, not pseudocode.
Open Source Code No The paper does not include an unambiguous statement about releasing source code for the methodology or a link to a code repository.
Open Datasets Yes We conduct our experiments on three video datasets, UCF101-24 (Soomro, Zamir, and Shah 2012) and JHMDB-21 (Jhuang et al. 2013). UCF101-24 consists of 24 classes with a total of 3207 untrimmmed videos with bounding box annotations. JHMDB-21 dataset has 21 classes from a total of 928 videos with pixel-level annotations. Both UCF101-24 and JHMDB-21 are focused on action detection task. We further generalize our approach on You Tube-VOS dataset, a video object segmentation task, which has temporally sparse pixel-wise mask annotation for specific objects. It has 3471 videos for training with 65 object categories.
Dataset Splits Yes We begin with 5% labeled data and increment by 5% in every AL cycle as shown in Figure 2(a-b). With more data, we notice that compared to baseline selection methods, our AL method is consistently performing better. We also notice a cold start problem for MC entropy as the model is not performing well for most samples in initial round of 10%, using only model entropy leads to non-optimum sample selection in future rounds.
Hardware Specification Yes We use the Py Torch to build our models and train them on single 16GB GPU.
Software Dependencies No The paper mentions using "Py Torch" but does not specify a version number or list other software dependencies with their versions.
Experiment Setup Yes We use a batch size of 8 for training with equal ratio of labeled and unlabeled sample per batch. We use the Adam optimizer (Kingma and Ba 2014) with a learning rate of 1e 4. We use EMA update at rate of 0.996. We train UCF101-24 for 80 epochs and JHMDB-51 for 50 epochs. Hyperparameters: We use a temporal block of T = 3 frames for the temporal average function in Equation 2. The loss weights for the consistency loss are λ1 = 0.5 and λ2 = 0.5 in Equation 10 and λ3 = [0.01 0.1] increased over warmup range.