Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners

Authors: Zhenhailong Wang, Manling Li, Ruochen Xu, Luowei Zhou, Jie Lei, Xudong Lin, Shuohang Wang, Ziyi Yang, Chenguang Zhu, Derek Hoiem, Shih-Fu Chang, Mohit Bansal, Heng Ji

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We compare our approach with state-of-the-art approaches on five benchmarks, i.e, MSR-VTT [62], MSVD [7], Va Te X [57], You Cook2 [74], and VLEP [23]. The statistics of the datasets can be found in Table 1. and We perform comprehensive ablation studies on our few-shot prompt including the impact of different video representation, number of shots and in-context selection.
Researcher Affiliation Collaboration Zhenhailong Wang1 , Manling Li1 , Ruochen Xu2, Luowei Zhou2 , Jie Lei3, Xudong Lin4, Shuohang Wang2, Ziyi Yang2, Chenguang Zhu2, Derek Hoiem1, Shih-Fu Chang4, Mohit Bansal3, Heng Ji1 1UIUC 2MSR 3UNC 4Columbia University {wangz3,hengji}@illinois.edu
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code and processed data are publicly available for research purposes at https://github.com/Mike Wang WZHL/Vid IL.
Open Datasets Yes We compare our approach with state-of-the-art approaches on five benchmarks, i.e, MSR-VTT [62], MSVD [7], Va Te X [57], You Cook2 [74], and VLEP [23].
Dataset Splits Yes Table 1: Statistics of datasets in our experiments Dataset Task Split Count # train / # eval and All the ablation results are evaluated on MSVD_QA validation set, and we report the mean and standard deviation of each setting on three sets of randomly sampled shots.
Hardware Specification Yes The experiments are conducted on 2 NVIDIA V100 (16GB) GPUs.
Software Dependencies No The paper mentions software like 'CLIP-L/14', 'BLIP captioning checkpoint', 'Instruct GPT', and 'semantic role labeling model from Allen NLP', but does not provide explicit version numbers for these software dependencies (e.g., 'PyTorch 1.9' or 'CUDA 11.1').
Experiment Setup Yes Unless otherwise specified, we sample 4 frames for frame level and 8 frames for visual token level. and If not otherwise specified, we use M=10 and N=5, which we consider as 10-shot training. and it details the in-context example selection strategy.