Visual Chain-of-Thought Prompting for Knowledge-Based Visual Reasoning

Authors: Zhenfang Chen, Qinhong Zhou, Yikang Shen, Yining Hong, Zhiqing Sun, Dan Gutfreund, Chuang Gan

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments on a range of knowledge-based visual reasoning datasets. We found our VCTP enjoys several benefits, 1). it achieves better performance than the previous few-shot learning baselines;
Researcher Affiliation Collaboration Zhenfang Chen1*, Qinhong Zhou2 , Yikang Shen1, Yining Hong3, Zhiqing Sun4, Dan Gutfreund1, Chuang Gan1,2 1MIT-IBM Watson AI Lab 2UMass Amherst 3University of California, Los Angeles 4Carnegie Mellon University
Pseudocode Yes Algorithm 1: Pipeline of the proposed VCTP
Open Source Code Yes Our code is available at https://github.com/UMass-Foundation-Model/Visual Co T.git
Open Datasets Yes We evaluate our models on standard KB-VR benchmarks, OK-VQA (Marino et al. 2019) and A-OKVQA (Schwenk et al. 2022).
Dataset Splits Yes We compare our VCTP with baselines on the validation and test sets of the A-OKVQA dataset in Table 1.
Hardware Specification No The paper mentions 'our hardware configuration' but does not provide specific hardware details (e.g., GPU/CPU models, memory).
Software Dependencies No The paper lists several pre-trained models (e.g., Faster R-CNN, BLIP, OPT-66B, Llama-2-70B, CLIP (Vi T-B/16)) that were used, but does not provide version numbers for general software dependencies or libraries (e.g., Python, PyTorch versions).
Experiment Setup Yes We fix the number of the in-context examples in think module to 8 since it is the largest number we could efficiently run on our hardware configuration. Following (Yang et al. 2022), we prompt the LLM with in-context example selection and multi-query ensemble. For in-context examples, we select the examples most similar to the current image-question pair in training set with their clip features. For multi-query ensemble, we feed our models and the baselines 5 times and select the one with the highest log-probability as previous methods (Yang et al. 2022; Chen et al. 2021a) except the aligned models in Table 3, where we ensemble 14 times for baselines to make them have similar computation cost as ours. ... with the m Iter in algorithm 1 to be 5.