reproducibilityindex.ai

Boundary Denoising for Video Activity Localization

Authors: Mengmeng Xu, Mattia Soldan, Jialin Gao, Shuming Liu, Juan-Manuel Perez-Rua, Bernard Ghanem

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show that Denoise Loc advances several video activity understanding tasks. For example, we observe a gain of +12.36% average m AP on the QV-Highlights dataset. Moreover, Denoise Loc achieves state-of-the-art performance on the MAD dataset but with much fewer predictions than others.
Researcher Affiliation	Collaboration	1King Abdullah University of Science and Technology (KAUST) 2National University of Singapore 3Meta AI
Pseudocode	No	The paper includes architectural diagrams and descriptions of its components but does not provide formal pseudocode or algorithm blocks.
Open Source Code	Yes	Our work is released to the public in accordance with the limitation of the MIT license.
Open Datasets	Yes	MAD (Soldan et al., 2022). This recently released dataset comprises 384K natural language queries (train 280,183, validation 32,064, test 72,044) temporally grounded in 650 full-length movies for a total of over 1.2K hours of video, making it the largest dataset collected for the video language grounding task. QVHighlights (Lei et al., 2021a). This is the only trimmed video dataset for the grounding task, constituted by 10,148 short videos with a duration of 150s. Notably, this dataset is characterized by multiple moments associated with each query yielding a total of 18,367 annotated moments and 10,310 queries (train 7,218, validation 1,550, test 1,542).
Dataset Splits	Yes	MAD (Soldan et al., 2022). This recently released dataset comprises 384K natural language queries (train 280,183, validation 32,064, test 72,044) temporally grounded in 650 full-length movies for a total of over 1.2K hours of video, making it the largest dataset collected for the video language grounding task. QVHighlights (Lei et al., 2021a). This is the only trimmed video dataset for the grounding task, constituted by 10,148 short videos with a duration of 150s. Notably, this dataset is characterized by multiple moments associated with each query yielding a total of 18,367 annotated moments and 10,310 queries (train 7,218, validation 1,550, test 1,542).
Hardware Specification	No	The paper mentions the software environment for testing (“Python 3.8, Py Torch 1.13, and CUDA 11.6”) but does not specify any hardware components like CPU or GPU models used for experiments.
Software Dependencies	Yes	Our algorithm is compiled and tested using Python 3.8, Py Torch 1.13, and CUDA 11.6.
Experiment Setup	Yes	Our training setting follows moment-DETR (Lei et al., 2021a). We also use a fixed number of 30 queries (proposals) during both training and inference. To train our model, we use the Adam W (Loshchilov & Hutter, 2019) optimizer with a learning rate of 1e 4 and weight decay of 1e 4. We train the model for 200 epochs and select the checkpoint with the best validation set performance for ablation.