Detecting Moments and Highlights in Videos via Natural Language Queries

Authors: Jie Lei, Tamara L Berg, Mohit Bansal

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We also present a strong baseline for this task, Moment-DETR, a transformer encoder-decoder model that views moment retrieval as a direct set prediction problem, taking extracted video and query representations as inputs and predicting moment coordinates and saliency scores end-to-end. While our model does not utilize any human prior, we show that it performs competitively when compared to well-engineered architectures. With weakly supervised pretraining using ASR captions, Moment DETR substantially outperforms previous methods. Lastly, we present several ablations and visualizations of Moment-DETR. Data and code is publicly available at https://github.com/jayleicn/moment_detr.
Researcher Affiliation Academia Jie Lei Tamara L. Berg Mohit Bansal Department of Computer Science University of North Carolina at Chapel Hill {jielei, tlberg, mbansal}@cs.unc.edu
Pseudocode No The paper provides architectural diagrams and mathematical loss formulations but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes Data and code is publicly available at https://github.com/jayleicn/moment_detr.
Open Datasets Yes Data and code is publicly available at https://github.com/jayleicn/moment_detr.
Dataset Splits Yes We split QVHIGHLIGHTS into 70% train, 15% val, and 15% test portions.
Hardware Specification Yes Both training/finetuning and pretraining are conducted on an RTX 2080Ti GPU, with training/finetuning taking 12 hours and pretraining 2 days.
Software Dependencies No Our model is implemented in PyTorch [29].
Experiment Setup Yes We set the hidden size d=256, #layers in encoder/decoder T=2, #moment queries N=10. We use dropout of 0.1 for transformer layers and 0.5 for input projection layers. We set the loss hyperparameters as λL1=10, λiou=1, λcls=4, λs=1, =0.2. The model weights are initialized with Xavier init [10]. We use AdamW [23] with an initial learning rate of 1e-4, weight decay 1e-4 to optimize the model parameters. The model is trained for 200 epochs with batch size 32. For pretraining, we use the same setup except that we train the model for 100 epochs with batch size 256.