reproducibilityindex.ai

Tackling Vision Language Tasks through Learning Inner Monologues

Authors: Diji Yang, Kezhen Chen, Jinmeng Rao, Xiaoyuan Guo, Yawen Zhang, Jie Yang, Yi Zhang

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	IMMO is evaluated on two popular tasks and achieves competitive performance with less training data when compared with state-of-the-art models while concurrently keeping the interpretability. The results suggest that by emulating the cognitive phenomenon of internal dialogue, our approach can enhance reasoning and explanation abilities, contributing to the more effective fusion of vision and language models. We evaluate IMMO on two vision-language reasoning tasks. Experiments show that IMMO can achieve competitive results compared with GPT-4-based hybrid integration approaches, while it uses significantly less training data and provides greater interpretability compared with embedding alignment approaches.
Researcher Affiliation	Collaboration	Diji Yang1, Kezhen Chen2, Jinmeng Rao2, Xiaoyuan Guo2, Yawen Zhang2, Jie Yang2, Yi Zhang1 1University of California, Santa Cruz 2Mineral {dyang39,yiz}@ucsc.edu {kezhenchen, jinmengrao, xiaoyuanguo, yawenz, yangjie}@mineral.ai
Pseudocode	Yes	Algorithm 1: IMMO Reinforcement Learning
Open Source Code	Yes	1The code and data are released at https://github.com/ucscirkm/immo.
Open Datasets	Yes	We construct a new training corpus for supervised humanprior fine-tuning by utilizing the A-OKVQA (Schwenk et al. 2022) dataset... We conduct experiments on the Science QA (SQA) (Lu et al. 2022) dataset... SNLI-VE (Xie et al. 2018) is a widely-used VE task built on top of SNLI (Bowman et al. 2015) and Flicker30k (Plummer et al. 2015) image datasets.
Dataset Splits	Yes	We follow the official train/validation/test split for all our experiments.
Hardware Specification	Yes	For broader applicability, we chose a model that can be trained on a single NVIDIA A100-40G GPU or equivalent instead of a more powerful but larger model.
Software Dependencies	No	At the reinforcement learning stage, the training is mainly based on the Transformers Reinforcement-Learning (TRL) solution (Von Werra et al. 2020) to wrap up the Hugging Face trainer (Wolf et al. 2020). While software components are named, specific version numbers for these or other key dependencies (e.g., Python, PyTorch) are not provided.
Experiment Setup	Yes	To ensure computational efficiency, we employed the Low-rank adaptation (Lo RA) (Hu et al. 2021) to train only 0.06% of the Vicuna-7b model, which corresponds to 5 million parameters. For simplicity, we used a fixed set of hyperparameters. Task-specific prompts for both LLM and VLM were designed manually, inspired by prompt templates used by You et al. (2023); Liu et al. (2023). To examine the impact of Inner Monologue turns on performance, we conduct ablation tests on Science QA using both few-shot and trained approaches. Using the same set of hyperparameters, we evaluate turns ranging from 0 to 5...