DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval
Authors: Xiangpeng Yang, Linchao Zhu, Xiaohan Wang, Yi Yang
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments reveal that when only 0.67% parameters are tuned, our cross-modal prompt tuning strategy DGL outperforms or is comparable to fully finetuning methods on MSR-VTT, VATEX, LSMDC, and Activity Net datasets. Code will be available at https://github.com/knightyxp/DGL |
| Researcher Affiliation | Academia | Xiangpeng Yang1, Linchao Zhu2, Xiaohan Wang2, Yi Yang2 * 1 Re LER, AAII, University of Technology Sydney 2 CCAI, Zhejiang University Xiangpeng.Yang@student.uts.edu.au, wxh1996111@gmail.com, {zhulinchao,yangyics}@zju.edu.cn |
| Pseudocode | No | The paper contains figures illustrating frameworks and attention mechanisms, but no structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code will be available at https://github.com/knightyxp/DGL |
| Open Datasets | Yes | We conduct experiments on four datasets including MSR-VTT (Xu et al. 2016), LSMDC (Rohrbach et al. 2015), Activity Net (Heilbron et al. 2015) and VATEX (Wang et al. 2019). |
| Dataset Splits | No | The paper lists datasets used (MSR-VTT, VATEX, LSMDC, Activity Net) and mentions using CLIP (Vi T-B/32) as pre-trained model, training parameters like learning rate, epochs, frame sampling, and prompt lengths. However, it does not explicitly state the train/validation/test split percentages or sample counts for these datasets. |
| Hardware Specification | No | The paper mentions GPU memory usage ('using above 30GB against CLIP4Clip s 20.8GB') and 'All experiments are done with mixed precision.' However, it does not specify any particular hardware components like specific GPU models, CPU models, or cloud computing instances used for the experiments. |
| Software Dependencies | No | The paper mentions using CLIP (Vi T-B/32) as the pre-trained model and the Adam W optimizer. However, it does not specify version numbers for any key software components or libraries (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | Implementation Details. We use the CLIP (Vi T-B/32) as the pre-trained model. During training, all the original parameters of CLIP are frozen unless explicitly mentioned. We apply a warm-up strategy followed by a cosine learning rate policy, using the Adam W optimizer with decoupled weight decay set to 0.2. The initial learning rate is 1e-2 for LSMDC and 5e-3 for the other three datasets. The max epochs are 10 for all datasets. Following CLIP4Clip, we uniformly sample 12 frames for MSRVTT, LSMDC, and VATEX and set the caption token length to 32. For Activity Net, the frame length and caption length are set to 64. All the videos short sides resize to 224, and the frame per second (fps) is 3. By default, the lengths of the frame prompts, text prefix/postfix prompts, and global prompts are all set to 4. Also, the depth of frame prompts and text prefix/postfix prompts is set to 12 by default. The inner dim of the adapter is set to 368. All experiments are done with mixed precision. |