Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners
Authors: Zhenhailong Wang, Manling Li, Ruochen Xu, Luowei Zhou, Jie Lei, Xudong Lin, Shuohang Wang, Ziyi Yang, Chenguang Zhu, Derek Hoiem, Shih-Fu Chang, Mohit Bansal, Heng Ji
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We compare our approach with state-of-the-art approaches on five benchmarks, i.e, MSR-VTT [62], MSVD [7], Va Te X [57], You Cook2 [74], and VLEP [23]. The statistics of the datasets can be found in Table 1. and We perform comprehensive ablation studies on our few-shot prompt including the impact of different video representation, number of shots and in-context selection. |
| Researcher Affiliation | Collaboration | Zhenhailong Wang1 , Manling Li1 , Ruochen Xu2, Luowei Zhou2 , Jie Lei3, Xudong Lin4, Shuohang Wang2, Ziyi Yang2, Chenguang Zhu2, Derek Hoiem1, Shih-Fu Chang4, Mohit Bansal3, Heng Ji1 1UIUC 2MSR 3UNC 4Columbia University {wangz3,hengji}@illinois.edu |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and processed data are publicly available for research purposes at https://github.com/Mike Wang WZHL/Vid IL. |
| Open Datasets | Yes | We compare our approach with state-of-the-art approaches on five benchmarks, i.e, MSR-VTT [62], MSVD [7], Va Te X [57], You Cook2 [74], and VLEP [23]. |
| Dataset Splits | Yes | Table 1: Statistics of datasets in our experiments Dataset Task Split Count # train / # eval and All the ablation results are evaluated on MSVD_QA validation set, and we report the mean and standard deviation of each setting on three sets of randomly sampled shots. |
| Hardware Specification | Yes | The experiments are conducted on 2 NVIDIA V100 (16GB) GPUs. |
| Software Dependencies | No | The paper mentions software like 'CLIP-L/14', 'BLIP captioning checkpoint', 'Instruct GPT', and 'semantic role labeling model from Allen NLP', but does not provide explicit version numbers for these software dependencies (e.g., 'PyTorch 1.9' or 'CUDA 11.1'). |
| Experiment Setup | Yes | Unless otherwise specified, we sample 4 frames for frame level and 8 frames for visual token level. and If not otherwise specified, we use M=10 and N=5, which we consider as 10-shot training. and it details the in-context example selection strategy. |