Boundary Denoising for Video Activity Localization
Authors: Mengmeng Xu, Mattia Soldan, Jialin Gao, Shuming Liu, Juan-Manuel Perez-Rua, Bernard Ghanem
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that Denoise Loc advances several video activity understanding tasks. For example, we observe a gain of +12.36% average m AP on the QV-Highlights dataset. Moreover, Denoise Loc achieves state-of-the-art performance on the MAD dataset but with much fewer predictions than others. |
| Researcher Affiliation | Collaboration | 1King Abdullah University of Science and Technology (KAUST) 2National University of Singapore 3Meta AI |
| Pseudocode | No | The paper includes architectural diagrams and descriptions of its components but does not provide formal pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our work is released to the public in accordance with the limitation of the MIT license. |
| Open Datasets | Yes | MAD (Soldan et al., 2022). This recently released dataset comprises 384K natural language queries (train 280,183, validation 32,064, test 72,044) temporally grounded in 650 full-length movies for a total of over 1.2K hours of video, making it the largest dataset collected for the video language grounding task. QVHighlights (Lei et al., 2021a). This is the only trimmed video dataset for the grounding task, constituted by 10,148 short videos with a duration of 150s. Notably, this dataset is characterized by multiple moments associated with each query yielding a total of 18,367 annotated moments and 10,310 queries (train 7,218, validation 1,550, test 1,542). |
| Dataset Splits | Yes | MAD (Soldan et al., 2022). This recently released dataset comprises 384K natural language queries (train 280,183, validation 32,064, test 72,044) temporally grounded in 650 full-length movies for a total of over 1.2K hours of video, making it the largest dataset collected for the video language grounding task. QVHighlights (Lei et al., 2021a). This is the only trimmed video dataset for the grounding task, constituted by 10,148 short videos with a duration of 150s. Notably, this dataset is characterized by multiple moments associated with each query yielding a total of 18,367 annotated moments and 10,310 queries (train 7,218, validation 1,550, test 1,542). |
| Hardware Specification | No | The paper mentions the software environment for testing (“Python 3.8, Py Torch 1.13, and CUDA 11.6”) but does not specify any hardware components like CPU or GPU models used for experiments. |
| Software Dependencies | Yes | Our algorithm is compiled and tested using Python 3.8, Py Torch 1.13, and CUDA 11.6. |
| Experiment Setup | Yes | Our training setting follows moment-DETR (Lei et al., 2021a). We also use a fixed number of 30 queries (proposals) during both training and inference. To train our model, we use the Adam W (Loshchilov & Hutter, 2019) optimizer with a learning rate of 1e 4 and weight decay of 1e 4. We train the model for 200 epochs and select the checkpoint with the best validation set performance for ablation. |