Transferable Video Moment Localization by Moment-Guided Query Prompting

Authors: Hao Jiang, Yang Yizhang, Yadong Mu

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We carry out extensive experiments on Charades-STA, TACo S, Di De Mo, and You Cook II datasets, and investigate the efficacy of the proposed method using various pre-trained models, such as CLIP, Action CLIP, CLIP4Clip, and Video CLIP. The experimental results demonstrate the effectiveness of our proposed method.
Researcher Affiliation Academia Hao Jiang, Yizhang Yang, Yadong Mu* Wangxuan Institute of Computer Technology, Peking University jianghao@stu.pku.edu.cn, myd@pku.edu.cn
Pseudocode No The paper describes the proposed method using figures and descriptive text but does not include any pseudocode or algorithm blocks.
Open Source Code Yes We release the code of this work on this website1. 1https://code-website.wixsite.com/prompt-code.
Open Datasets Yes TACo S contains 127 videos of kitchen scenes (Regneri et al. 2013) and 18, 818 video-language pairs. ... Charades-STA contains 9,848 videos of daily indoor activities (Sigurdsson et al. 2016). ... Di De Mo contains 10,464 Flickr videos and 40,543 annotated queries (Anne Hendricks et al. 2017). ... You Cook II is an instructional video dataset collected by (Zhou, Xu, and Corso 2018)...
Dataset Splits Yes We follow the segmentation by (Gao et al. 2017), with 10,146, 4,589, 4,083 video-query pairs in training, validation, and test sets. ... We use the former dataset for training and the latter dataset for validation and testing.
Hardware Specification No The paper mentions using pre-trained vision-language models but does not provide any specific details about the hardware used for training or inference, such as GPU models, CPU specifications, or cloud computing resources.
Software Dependencies No The paper mentions pre-trained vision-language models (CLIP, Action CLIP, Video CLIP, CLIP4Clip) and the Adam W optimizer, but it does not specify any software libraries or frameworks with their version numbers (e.g., PyTorch 1.x, TensorFlow 2.x).
Experiment Setup Yes The number of global prompts z is set to 4. The number of sampled video clips used in the model is 32, and the size of the 2D moment map is 16... The hidden dimension of the model is 512, and the number of layers in the transformer block is 6. The number of heads in the multi-head attention layer is 8. Other parameter settings (e.g., non-maximum suppression threshold, scaling thresholds) are consistent with baseline methods (Wang et al. 2022b; Zhang et al. 2020a). Adam W optimizer (Loshchilov and Hutter 2018) is adopted in the experiment.