Semi-supervised Active Learning for Video Action Detection
Authors: Ayush Singh, Aayush J Rana, Akash Kumar, Shruti Vyas, Yogesh Singh Rawat
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the proposed approach on three different benchmark datasets, UCF-101-24, JHMDB-21, and Youtube-VOS. First, we demonstrate its effectiveness on video action detection where the proposed approach outperforms prior works in semi-supervised and weakly-supervised learning along with several baseline approaches in both UCF101-24 and JHMDB21. Next, we also show its effectiveness on Youtube-VOS for video object segmentation demonstrating its generalization capability for other dense prediction tasks in videos. |
| Researcher Affiliation | Academia | Ayush Singh1, Aayush J Rana2, Akash Kumar2, Shruti Vyas2, Yogesh Singh Rawat2 1 IIT (ISM) Dhanbad 2University of Central Florida ayush.s.18je0204@cse.iitism.ac.in, {aayushjungbahadur.rana, akash.kumar, shruti, yogesh}@ucf.edu |
| Pseudocode | No | The paper does not contain a pseudocode block or a clearly labeled algorithm section. Figure 1 provides a high-level overview diagram, not pseudocode. |
| Open Source Code | No | The paper does not include an unambiguous statement about releasing source code for the methodology or a link to a code repository. |
| Open Datasets | Yes | We conduct our experiments on three video datasets, UCF101-24 (Soomro, Zamir, and Shah 2012) and JHMDB-21 (Jhuang et al. 2013). UCF101-24 consists of 24 classes with a total of 3207 untrimmmed videos with bounding box annotations. JHMDB-21 dataset has 21 classes from a total of 928 videos with pixel-level annotations. Both UCF101-24 and JHMDB-21 are focused on action detection task. We further generalize our approach on You Tube-VOS dataset, a video object segmentation task, which has temporally sparse pixel-wise mask annotation for specific objects. It has 3471 videos for training with 65 object categories. |
| Dataset Splits | Yes | We begin with 5% labeled data and increment by 5% in every AL cycle as shown in Figure 2(a-b). With more data, we notice that compared to baseline selection methods, our AL method is consistently performing better. We also notice a cold start problem for MC entropy as the model is not performing well for most samples in initial round of 10%, using only model entropy leads to non-optimum sample selection in future rounds. |
| Hardware Specification | Yes | We use the Py Torch to build our models and train them on single 16GB GPU. |
| Software Dependencies | No | The paper mentions using "Py Torch" but does not specify a version number or list other software dependencies with their versions. |
| Experiment Setup | Yes | We use a batch size of 8 for training with equal ratio of labeled and unlabeled sample per batch. We use the Adam optimizer (Kingma and Ba 2014) with a learning rate of 1e 4. We use EMA update at rate of 0.996. We train UCF101-24 for 80 epochs and JHMDB-51 for 50 epochs. Hyperparameters: We use a temporal block of T = 3 frames for the temporal average function in Equation 2. The loss weights for the consistency loss are λ1 = 0.5 and λ2 = 0.5 in Equation 10 and λ3 = [0.01 0.1] increased over warmup range. |