Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners
Authors: Zhenhailong Wang, Manling Li, Ruochen Xu, Luowei Zhou, Jie Lei, Xudong Lin, Shuohang Wang, Ziyi Yang, Chenguang Zhu, Derek Hoiem, Shih-Fu Chang, Mohit Bansal, Heng Ji
NeurIPS 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We compare our approach with state-of-the-art approaches on five benchmarks, i.e, MSR-VTT [62], MSVD [7], Va Te X [57], You Cook2 [74], and VLEP [23]. The statistics of the datasets can be found in Table 1. and We perform comprehensive ablation studies on our few-shot prompt including the impact of different video representation, number of shots and in-context selection. |
| Researcher Affiliation | Collaboration | Zhenhailong Wang1 , Manling Li1 , Ruochen Xu2, Luowei Zhou2 , Jie Lei3, Xudong Lin4, Shuohang Wang2, Ziyi Yang2, Chenguang Zhu2, Derek Hoiem1, Shih-Fu Chang4, Mohit Bansal3, Heng Ji1 1UIUC 2MSR 3UNC 4Columbia University EMAIL |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and processed data are publicly available for research purposes at https://github.com/Mike Wang WZHL/Vid IL. |
| Open Datasets | Yes | We compare our approach with state-of-the-art approaches on five benchmarks, i.e, MSR-VTT [62], MSVD [7], Va Te X [57], You Cook2 [74], and VLEP [23]. |
| Dataset Splits | Yes | Table 1: Statistics of datasets in our experiments Dataset Task Split Count # train / # eval and All the ablation results are evaluated on MSVD_QA validation set, and we report the mean and standard deviation of each setting on three sets of randomly sampled shots. |
| Hardware Specification | Yes | The experiments are conducted on 2 NVIDIA V100 (16GB) GPUs. |
| Software Dependencies | No | The paper mentions software like 'CLIP-L/14', 'BLIP captioning checkpoint', 'Instruct GPT', and 'semantic role labeling model from Allen NLP', but does not provide explicit version numbers for these software dependencies (e.g., 'PyTorch 1.9' or 'CUDA 11.1'). |
| Experiment Setup | Yes | Unless otherwise specified, we sample 4 frames for frame level and 8 frames for visual token level. and If not otherwise specified, we use M=10 and N=5, which we consider as 10-shot training. and it details the in-context example selection strategy. |