Boundary Denoising for Video Activity Localization

Authors: Mengmeng Xu, Mattia Soldan, Jialin Gao, Shuming Liu, Juan-Manuel Perez-Rua, Bernard Ghanem

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that Denoise Loc advances several video activity understanding tasks. For example, we observe a gain of +12.36% average m AP on the QV-Highlights dataset. Moreover, Denoise Loc achieves state-of-the-art performance on the MAD dataset but with much fewer predictions than others.
Researcher Affiliation Collaboration 1King Abdullah University of Science and Technology (KAUST) 2National University of Singapore 3Meta AI
Pseudocode No The paper includes architectural diagrams and descriptions of its components but does not provide formal pseudocode or algorithm blocks.
Open Source Code Yes Our work is released to the public in accordance with the limitation of the MIT license.
Open Datasets Yes MAD (Soldan et al., 2022). This recently released dataset comprises 384K natural language queries (train 280,183, validation 32,064, test 72,044) temporally grounded in 650 full-length movies for a total of over 1.2K hours of video, making it the largest dataset collected for the video language grounding task. QVHighlights (Lei et al., 2021a). This is the only trimmed video dataset for the grounding task, constituted by 10,148 short videos with a duration of 150s. Notably, this dataset is characterized by multiple moments associated with each query yielding a total of 18,367 annotated moments and 10,310 queries (train 7,218, validation 1,550, test 1,542).
Dataset Splits Yes MAD (Soldan et al., 2022). This recently released dataset comprises 384K natural language queries (train 280,183, validation 32,064, test 72,044) temporally grounded in 650 full-length movies for a total of over 1.2K hours of video, making it the largest dataset collected for the video language grounding task. QVHighlights (Lei et al., 2021a). This is the only trimmed video dataset for the grounding task, constituted by 10,148 short videos with a duration of 150s. Notably, this dataset is characterized by multiple moments associated with each query yielding a total of 18,367 annotated moments and 10,310 queries (train 7,218, validation 1,550, test 1,542).
Hardware Specification No The paper mentions the software environment for testing (“Python 3.8, Py Torch 1.13, and CUDA 11.6”) but does not specify any hardware components like CPU or GPU models used for experiments.
Software Dependencies Yes Our algorithm is compiled and tested using Python 3.8, Py Torch 1.13, and CUDA 11.6.
Experiment Setup Yes Our training setting follows moment-DETR (Lei et al., 2021a). We also use a fixed number of 30 queries (proposals) during both training and inference. To train our model, we use the Adam W (Loshchilov & Hutter, 2019) optimizer with a learning rate of 1e 4 and weight decay of 1e 4. We train the model for 200 epochs and select the checkpoint with the best validation set performance for ablation.