reproducibilityindex.ai

Token Mixing: Parameter-Efficient Transfer Learning from Image-Language to Video-Language

Authors: Yuqi Liu, Luhui Xu, Pengfei Xiong, Qin Jin

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We carry out extensive experiments to compare our proposed token mixing method with other parameter-efficient transfer learning methods. Our token mixing method outperforms other methods on both understanding tasks and generation tasks. Besides, our method achieves new records on multiple video-language tasks.
Researcher Affiliation	Collaboration	Yuqi Liu1, 2*, Luhui Xu2, Pengfei Xiong2, Qin Jin1 1 School of Information, Renmin University of China 2 Tencent
Pseudocode	No	The paper provides mathematical equations and describes procedures in text, but it does not include any clearly labeled pseudocode blocks or algorithm sections.
Open Source Code	Yes	The code is available at https://github.com/yuqi657/video language model.
Open Datasets	Yes	For video captioning task, we use widely adopted benchmarks, MSRVTT (Xu et al. 2016), VATEX (Wang et al. 2019), MSVD (Chen and Dolan 2011). For video retrieval task, we choose MSRVTT (Xu et al. 2016) and LSMDC (Rohrbach et al. 2017).
Dataset Splits	No	The paper describes the datasets used and mentions evaluation on test sets, but it does not explicitly state the details of any validation dataset splits or specific validation methodologies for hyperparameter tuning.
Hardware Specification	Yes	Table 1, titled “Video captioning results of different fine-tuning methods on MSVD, VATEX and MSRVTT,” includes a column labeled “Mem” referring to “Memory Usage per GPU,” with values like “28.6,” “25.9,” and “21.9.” This indicates that experiments were run on GPUs and provides a memory specification.
Software Dependencies	No	The paper mentions using BLIP as a backbone model, but it does not specify version numbers for any software dependencies, programming languages, or libraries used in the implementation or experimentation (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	In most of our experiments, we choose BLIP(Vi T-B/16) (Li et al. 2022) as our default base backbone model, where input frames are split into 16 16 patch sequence and then input to a 12-layer visual transformers. The text is encoded/decoded by a 12-layer transformer. The feature dimension of both text and video is 768. All models are initialized using BLIP (Vi T-B/16), and we use the same setting (e.g. batch size) in all experiments for fair comparison.