Transferable Video Moment Localization by Moment-Guided Query Prompting
Authors: Hao Jiang, Yang Yizhang, Yadong Mu
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We carry out extensive experiments on Charades-STA, TACo S, Di De Mo, and You Cook II datasets, and investigate the efficacy of the proposed method using various pre-trained models, such as CLIP, Action CLIP, CLIP4Clip, and Video CLIP. The experimental results demonstrate the effectiveness of our proposed method. |
| Researcher Affiliation | Academia | Hao Jiang, Yizhang Yang, Yadong Mu* Wangxuan Institute of Computer Technology, Peking University jianghao@stu.pku.edu.cn, myd@pku.edu.cn |
| Pseudocode | No | The paper describes the proposed method using figures and descriptive text but does not include any pseudocode or algorithm blocks. |
| Open Source Code | Yes | We release the code of this work on this website1. 1https://code-website.wixsite.com/prompt-code. |
| Open Datasets | Yes | TACo S contains 127 videos of kitchen scenes (Regneri et al. 2013) and 18, 818 video-language pairs. ... Charades-STA contains 9,848 videos of daily indoor activities (Sigurdsson et al. 2016). ... Di De Mo contains 10,464 Flickr videos and 40,543 annotated queries (Anne Hendricks et al. 2017). ... You Cook II is an instructional video dataset collected by (Zhou, Xu, and Corso 2018)... |
| Dataset Splits | Yes | We follow the segmentation by (Gao et al. 2017), with 10,146, 4,589, 4,083 video-query pairs in training, validation, and test sets. ... We use the former dataset for training and the latter dataset for validation and testing. |
| Hardware Specification | No | The paper mentions using pre-trained vision-language models but does not provide any specific details about the hardware used for training or inference, such as GPU models, CPU specifications, or cloud computing resources. |
| Software Dependencies | No | The paper mentions pre-trained vision-language models (CLIP, Action CLIP, Video CLIP, CLIP4Clip) and the Adam W optimizer, but it does not specify any software libraries or frameworks with their version numbers (e.g., PyTorch 1.x, TensorFlow 2.x). |
| Experiment Setup | Yes | The number of global prompts z is set to 4. The number of sampled video clips used in the model is 32, and the size of the 2D moment map is 16... The hidden dimension of the model is 512, and the number of layers in the transformer block is 6. The number of heads in the multi-head attention layer is 8. Other parameter settings (e.g., non-maximum suppression threshold, scaling thresholds) are consistent with baseline methods (Wang et al. 2022b; Zhang et al. 2020a). Adam W optimizer (Loshchilov and Hutter 2018) is adopted in the experiment. |