Token Mixing: Parameter-Efficient Transfer Learning from Image-Language to Video-Language
Authors: Yuqi Liu, Luhui Xu, Pengfei Xiong, Qin Jin
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We carry out extensive experiments to compare our proposed token mixing method with other parameter-efficient transfer learning methods. Our token mixing method outperforms other methods on both understanding tasks and generation tasks. Besides, our method achieves new records on multiple video-language tasks. |
| Researcher Affiliation | Collaboration | Yuqi Liu1, 2*, Luhui Xu2, Pengfei Xiong2, Qin Jin1 1 School of Information, Renmin University of China 2 Tencent |
| Pseudocode | No | The paper provides mathematical equations and describes procedures in text, but it does not include any clearly labeled pseudocode blocks or algorithm sections. |
| Open Source Code | Yes | The code is available at https://github.com/yuqi657/video language model. |
| Open Datasets | Yes | For video captioning task, we use widely adopted benchmarks, MSRVTT (Xu et al. 2016), VATEX (Wang et al. 2019), MSVD (Chen and Dolan 2011). For video retrieval task, we choose MSRVTT (Xu et al. 2016) and LSMDC (Rohrbach et al. 2017). |
| Dataset Splits | No | The paper describes the datasets used and mentions evaluation on test sets, but it does not explicitly state the details of any validation dataset splits or specific validation methodologies for hyperparameter tuning. |
| Hardware Specification | Yes | Table 1, titled “Video captioning results of different fine-tuning methods on MSVD, VATEX and MSRVTT,” includes a column labeled “Mem” referring to “Memory Usage per GPU,” with values like “28.6,” “25.9,” and “21.9.” This indicates that experiments were run on GPUs and provides a memory specification. |
| Software Dependencies | No | The paper mentions using BLIP as a backbone model, but it does not specify version numbers for any software dependencies, programming languages, or libraries used in the implementation or experimentation (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | In most of our experiments, we choose BLIP(Vi T-B/16) (Li et al. 2022) as our default base backbone model, where input frames are split into 16 16 patch sequence and then input to a 12-layer visual transformers. The text is encoded/decoded by a 12-layer transformer. The feature dimension of both text and video is 768. All models are initialized using BLIP (Vi T-B/16), and we use the same setting (e.g. batch size) in all experiments for fair comparison. |