reproducibilityindex.ai

DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval

Authors: Xiangpeng Yang, Linchao Zhu, Xiaohan Wang, Yi Yang

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments reveal that when only 0.67% parameters are tuned, our cross-modal prompt tuning strategy DGL outperforms or is comparable to fully finetuning methods on MSR-VTT, VATEX, LSMDC, and Activity Net datasets. Code will be available at https://github.com/knightyxp/DGL
Researcher Affiliation	Academia	Xiangpeng Yang1, Linchao Zhu2, Xiaohan Wang2, Yi Yang2 * 1 Re LER, AAII, University of Technology Sydney 2 CCAI, Zhejiang University Xiangpeng.Yang@student.uts.edu.au, wxh1996111@gmail.com, {zhulinchao,yangyics}@zju.edu.cn
Pseudocode	No	The paper contains figures illustrating frameworks and attention mechanisms, but no structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code will be available at https://github.com/knightyxp/DGL
Open Datasets	Yes	We conduct experiments on four datasets including MSR-VTT (Xu et al. 2016), LSMDC (Rohrbach et al. 2015), Activity Net (Heilbron et al. 2015) and VATEX (Wang et al. 2019).
Dataset Splits	No	The paper lists datasets used (MSR-VTT, VATEX, LSMDC, Activity Net) and mentions using CLIP (Vi T-B/32) as pre-trained model, training parameters like learning rate, epochs, frame sampling, and prompt lengths. However, it does not explicitly state the train/validation/test split percentages or sample counts for these datasets.
Hardware Specification	No	The paper mentions GPU memory usage ('using above 30GB against CLIP4Clip s 20.8GB') and 'All experiments are done with mixed precision.' However, it does not specify any particular hardware components like specific GPU models, CPU models, or cloud computing instances used for the experiments.
Software Dependencies	No	The paper mentions using CLIP (Vi T-B/32) as the pre-trained model and the Adam W optimizer. However, it does not specify version numbers for any key software components or libraries (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	Implementation Details. We use the CLIP (Vi T-B/32) as the pre-trained model. During training, all the original parameters of CLIP are frozen unless explicitly mentioned. We apply a warm-up strategy followed by a cosine learning rate policy, using the Adam W optimizer with decoupled weight decay set to 0.2. The initial learning rate is 1e-2 for LSMDC and 5e-3 for the other three datasets. The max epochs are 10 for all datasets. Following CLIP4Clip, we uniformly sample 12 frames for MSRVTT, LSMDC, and VATEX and set the caption token length to 32. For Activity Net, the frame length and caption length are set to 64. All the videos short sides resize to 224, and the frame per second (fps) is 3. By default, the lengths of the frame prompts, text prefix/postfix prompts, and global prompts are all set to 4. Also, the depth of frame prompts and text prefix/postfix prompts is set to 12 by default. The inner dim of the adapter is set to 368. All experiments are done with mixed precision.