Large Language Models are Visual Reasoning Coordinators
Authors: Liangyu Chen, Bo Li, Sheng Shen, Jingkang Yang, Chunyuan Li, Kurt Keutzer, Trevor Darrell, Ziwei Liu
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that our instruction tuning variant, Cola-FT, achieves state-of-the-art performance on visual question answering (VQA), outside knowledge VQA, visual entailment, and visual spatial reasoning tasks. Moreover, we show that our in-context learning variant, Cola-Zero, exhibits competitive performance in zero and few-shot settings, without finetuning. Through systematic ablation studies and visualizations, we validate that a coordinator LLM indeed comprehends the instruction prompts as well as the separate functionalities of VLMs; it then coordinates them to enable impressive visual reasoning capabilities. |
| Researcher Affiliation | Collaboration | S-Lab, Nanyang Technological University University of California, Berkeley Microsoft Research, Redmond {lchen025, libo0013, ziwei.liu}@ntu.edu.sg |
| Pseudocode | No | The paper describes its methods and processes but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | https://github.com/cliangyu/Cola |
| Open Datasets | Yes | Our experiments are conducted on a challenging suite of three diverse visual reasoning tasks, including outside knowledge VQA, visual entailment, and visual spatial reasoning. For each task, we select the following dataset respectively. Visual Question Answering v2 [27] (VQA v2)... Augmented Outside Knowledge VQA [89] (A-OKVQA)... Outside Knowledge VQA [63] (OK-VQA)... e-SNLI-VE [21]... Visual Spatial Reasoning [56] (VSR)... GQA [35]... Compositional Language and Elementary Visual Reasoning [40] (CLEVR)... |
| Dataset Splits | Yes | The dataset contains 700k questions in the training set and 150k in the validation set. [...] In A-OKVQA, we report both val/test accuracies, and val accuracy in VQA v2, OK-VQA, e-SNLI-VE, GQA, and CLEVR; test (zeroshot split) accuracy in VSR. |
| Hardware Specification | Yes | We finetune and evaluate the models on NVIDIA V100 or A100 GPUs. |
| Software Dependencies | No | The paper mentions 'Hugging Face Transformers library' and 'all-mpnet-base-v2 model', but it does not specify version numbers for these or other software dependencies. |
| Experiment Setup | Yes | We use an Ada Factor optimizer [92] at the learning rate of 1e-4 for all Cola-FT experiments. The batch size is by default set to 16, though we find Cola-FT insensitive to batch size. |