Tackling Vision Language Tasks through Learning Inner Monologues
Authors: Diji Yang, Kezhen Chen, Jinmeng Rao, Xiaoyuan Guo, Yawen Zhang, Jie Yang, Yi Zhang
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | IMMO is evaluated on two popular tasks and achieves competitive performance with less training data when compared with state-of-the-art models while concurrently keeping the interpretability. The results suggest that by emulating the cognitive phenomenon of internal dialogue, our approach can enhance reasoning and explanation abilities, contributing to the more effective fusion of vision and language models. We evaluate IMMO on two vision-language reasoning tasks. Experiments show that IMMO can achieve competitive results compared with GPT-4-based hybrid integration approaches, while it uses significantly less training data and provides greater interpretability compared with embedding alignment approaches. |
| Researcher Affiliation | Collaboration | Diji Yang1, Kezhen Chen2, Jinmeng Rao2, Xiaoyuan Guo2, Yawen Zhang2, Jie Yang2, Yi Zhang1 1University of California, Santa Cruz 2Mineral {dyang39,yiz}@ucsc.edu {kezhenchen, jinmengrao, xiaoyuanguo, yawenz, yangjie}@mineral.ai |
| Pseudocode | Yes | Algorithm 1: IMMO Reinforcement Learning |
| Open Source Code | Yes | 1The code and data are released at https://github.com/ucscirkm/immo. |
| Open Datasets | Yes | We construct a new training corpus for supervised humanprior fine-tuning by utilizing the A-OKVQA (Schwenk et al. 2022) dataset... We conduct experiments on the Science QA (SQA) (Lu et al. 2022) dataset... SNLI-VE (Xie et al. 2018) is a widely-used VE task built on top of SNLI (Bowman et al. 2015) and Flicker30k (Plummer et al. 2015) image datasets. |
| Dataset Splits | Yes | We follow the official train/validation/test split for all our experiments. |
| Hardware Specification | Yes | For broader applicability, we chose a model that can be trained on a single NVIDIA A100-40G GPU or equivalent instead of a more powerful but larger model. |
| Software Dependencies | No | At the reinforcement learning stage, the training is mainly based on the Transformers Reinforcement-Learning (TRL) solution (Von Werra et al. 2020) to wrap up the Hugging Face trainer (Wolf et al. 2020). While software components are named, specific version numbers for these or other key dependencies (e.g., Python, PyTorch) are not provided. |
| Experiment Setup | Yes | To ensure computational efficiency, we employed the Low-rank adaptation (Lo RA) (Hu et al. 2021) to train only 0.06% of the Vicuna-7b model, which corresponds to 5 million parameters. For simplicity, we used a fixed set of hyperparameters. Task-specific prompts for both LLM and VLM were designed manually, inspired by prompt templates used by You et al. (2023); Liu et al. (2023). To examine the impact of Inner Monologue turns on performance, we conduct ablation tests on Science QA using both few-shot and trained approaches. Using the same set of hyperparameters, we evaluate turns ranging from 0 to 5... |