reproducibilityindex.ai

Large Language Models are Visual Reasoning Coordinators

Authors: Liangyu Chen, Bo Li, Sheng Shen, Jingkang Yang, Chunyuan Li, Kurt Keutzer, Trevor Darrell, Ziwei Liu

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that our instruction tuning variant, Cola-FT, achieves state-of-the-art performance on visual question answering (VQA), outside knowledge VQA, visual entailment, and visual spatial reasoning tasks. Moreover, we show that our in-context learning variant, Cola-Zero, exhibits competitive performance in zero and few-shot settings, without finetuning. Through systematic ablation studies and visualizations, we validate that a coordinator LLM indeed comprehends the instruction prompts as well as the separate functionalities of VLMs; it then coordinates them to enable impressive visual reasoning capabilities.
Researcher Affiliation	Collaboration	S-Lab, Nanyang Technological University University of California, Berkeley Microsoft Research, Redmond {lchen025, libo0013, ziwei.liu}@ntu.edu.sg
Pseudocode	No	The paper describes its methods and processes but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	https://github.com/cliangyu/Cola
Open Datasets	Yes	Our experiments are conducted on a challenging suite of three diverse visual reasoning tasks, including outside knowledge VQA, visual entailment, and visual spatial reasoning. For each task, we select the following dataset respectively. Visual Question Answering v2 [27] (VQA v2)... Augmented Outside Knowledge VQA [89] (A-OKVQA)... Outside Knowledge VQA [63] (OK-VQA)... e-SNLI-VE [21]... Visual Spatial Reasoning [56] (VSR)... GQA [35]... Compositional Language and Elementary Visual Reasoning [40] (CLEVR)...
Dataset Splits	Yes	The dataset contains 700k questions in the training set and 150k in the validation set. [...] In A-OKVQA, we report both val/test accuracies, and val accuracy in VQA v2, OK-VQA, e-SNLI-VE, GQA, and CLEVR; test (zeroshot split) accuracy in VSR.
Hardware Specification	Yes	We finetune and evaluate the models on NVIDIA V100 or A100 GPUs.
Software Dependencies	No	The paper mentions 'Hugging Face Transformers library' and 'all-mpnet-base-v2 model', but it does not specify version numbers for these or other software dependencies.
Experiment Setup	Yes	We use an Ada Factor optimizer [92] at the learning rate of 1e-4 for all Cola-FT experiments. The batch size is by default set to 16, though we find Cola-FT insensitive to batch size.