Large Language Models are Visual Reasoning Coordinators

Authors: Liangyu Chen, Bo Li, Sheng Shen, Jingkang Yang, Chunyuan Li, Kurt Keutzer, Trevor Darrell, Ziwei Liu

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that our instruction tuning variant, Cola-FT, achieves state-of-the-art performance on visual question answering (VQA), outside knowledge VQA, visual entailment, and visual spatial reasoning tasks. Moreover, we show that our in-context learning variant, Cola-Zero, exhibits competitive performance in zero and few-shot settings, without finetuning. Through systematic ablation studies and visualizations, we validate that a coordinator LLM indeed comprehends the instruction prompts as well as the separate functionalities of VLMs; it then coordinates them to enable impressive visual reasoning capabilities.
Researcher Affiliation Collaboration S-Lab, Nanyang Technological University University of California, Berkeley Microsoft Research, Redmond {lchen025, libo0013, ziwei.liu}@ntu.edu.sg
Pseudocode No The paper describes its methods and processes but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes https://github.com/cliangyu/Cola
Open Datasets Yes Our experiments are conducted on a challenging suite of three diverse visual reasoning tasks, including outside knowledge VQA, visual entailment, and visual spatial reasoning. For each task, we select the following dataset respectively. Visual Question Answering v2 [27] (VQA v2)... Augmented Outside Knowledge VQA [89] (A-OKVQA)... Outside Knowledge VQA [63] (OK-VQA)... e-SNLI-VE [21]... Visual Spatial Reasoning [56] (VSR)... GQA [35]... Compositional Language and Elementary Visual Reasoning [40] (CLEVR)...
Dataset Splits Yes The dataset contains 700k questions in the training set and 150k in the validation set. [...] In A-OKVQA, we report both val/test accuracies, and val accuracy in VQA v2, OK-VQA, e-SNLI-VE, GQA, and CLEVR; test (zeroshot split) accuracy in VSR.
Hardware Specification Yes We finetune and evaluate the models on NVIDIA V100 or A100 GPUs.
Software Dependencies No The paper mentions 'Hugging Face Transformers library' and 'all-mpnet-base-v2 model', but it does not specify version numbers for these or other software dependencies.
Experiment Setup Yes We use an Ada Factor optimizer [92] at the learning rate of 1e-4 for all Cola-FT experiments. The batch size is by default set to 16, though we find Cola-FT insensitive to batch size.