AVIS: Autonomous Visual Information Seeking with Large Language Model Agent

Authors: Ziniu Hu, Ahmet Iscen, Chen Sun, Kai-Wei Chang, Yizhou Sun, David Ross, Cordelia Schmid, Alireza Fathi

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct a user study to collect a variety of instances of human decision-making when faced with this task. This data is then used to design a system comprised of three components: an LLM-powered planner that dynamically determines which tool to use next, an LLM-powered reasoner that analyzes and extracts key information from the tool outputs, and a working memory component that retains the acquired information throughout the process. We show that AVIS achieves state-of-the-art results on knowledge-intensive visual question answering benchmarks such as Infoseek [7] and OK-VQA [26].
Researcher Affiliation Collaboration Ziniu Hu12 Ahmet Iscen2 Chen Sun2 Kai-Wei Chang1 Yizhou Sun1 David A Ross2 Cordelia Schmid2 Alireza Fathi2 1University of California, Los Angeles, 2Google Research
Pseudocode Yes Algorithm 1 Planner P(state, G, E, M) and Algorithm 2 AVIS Decision Making Workflow
Open Source Code No The paper does not provide any explicit statements about releasing source code or links to a code repository.
Open Datasets Yes We evaluate AVIS on two visual question answering datasets: i) OK-VQA [26], which requires common-sense knowledge not observed in given image; and ii) Infoseekwikidata [7], which further necessitates more fine-grained information that cannot be covered by common sense knowledge.
Dataset Splits No The paper mentions "unseen entity split" and "unseen question split" for Infoseek and uses OK-VQA, but it does not provide specific percentages or counts for training, validation, or test splits, nor does it explicitly detail the splitting methodology or reference predefined splits with specific details for reproducibility.
Hardware Specification No The paper mentions using specific models like PALM 540B and PALI 17B, but does not specify the underlying hardware (e.g., GPU models, CPU types, or TPU versions) used for running the experiments.
Software Dependencies No The paper mentions specific models like PALI 17B and PALM 540B, but does not provide a reproducible description of ancillary software, such as programming languages, libraries, or solvers, with specific version numbers (e.g., "Python 3.8, PyTorch 1.9").
Experiment Setup Yes We use the frozen PALM 540B language model [9] for both the planner and the reasoner, with deterministic generation ensured by setting the temperature parameter to zero. We use 10 examples as in-context prompts for each dataset, and report the VQA accuracy [11] as the evaluation metric.