Detecting Moments and Highlights in Videos via Natural Language Queries
Authors: Jie Lei, Tamara L Berg, Mohit Bansal
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We also present a strong baseline for this task, Moment-DETR, a transformer encoder-decoder model that views moment retrieval as a direct set prediction problem, taking extracted video and query representations as inputs and predicting moment coordinates and saliency scores end-to-end. While our model does not utilize any human prior, we show that it performs competitively when compared to well-engineered architectures. With weakly supervised pretraining using ASR captions, Moment DETR substantially outperforms previous methods. Lastly, we present several ablations and visualizations of Moment-DETR. Data and code is publicly available at https://github.com/jayleicn/moment_detr. |
| Researcher Affiliation | Academia | Jie Lei Tamara L. Berg Mohit Bansal Department of Computer Science University of North Carolina at Chapel Hill {jielei, tlberg, mbansal}@cs.unc.edu |
| Pseudocode | No | The paper provides architectural diagrams and mathematical loss formulations but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Data and code is publicly available at https://github.com/jayleicn/moment_detr. |
| Open Datasets | Yes | Data and code is publicly available at https://github.com/jayleicn/moment_detr. |
| Dataset Splits | Yes | We split QVHIGHLIGHTS into 70% train, 15% val, and 15% test portions. |
| Hardware Specification | Yes | Both training/finetuning and pretraining are conducted on an RTX 2080Ti GPU, with training/finetuning taking 12 hours and pretraining 2 days. |
| Software Dependencies | No | Our model is implemented in PyTorch [29]. |
| Experiment Setup | Yes | We set the hidden size d=256, #layers in encoder/decoder T=2, #moment queries N=10. We use dropout of 0.1 for transformer layers and 0.5 for input projection layers. We set the loss hyperparameters as λL1=10, λiou=1, λcls=4, λs=1, =0.2. The model weights are initialized with Xavier init [10]. We use AdamW [23] with an initial learning rate of 1e-4, weight decay 1e-4 to optimize the model parameters. The model is trained for 200 epochs with batch size 32. For pretraining, we use the same setup except that we train the model for 100 epochs with batch size 256. |