Visual Chain-of-Thought Prompting for Knowledge-Based Visual Reasoning
Authors: Zhenfang Chen, Qinhong Zhou, Yikang Shen, Yining Hong, Zhiqing Sun, Dan Gutfreund, Chuang Gan
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments on a range of knowledge-based visual reasoning datasets. We found our VCTP enjoys several benefits, 1). it achieves better performance than the previous few-shot learning baselines; |
| Researcher Affiliation | Collaboration | Zhenfang Chen1*, Qinhong Zhou2 , Yikang Shen1, Yining Hong3, Zhiqing Sun4, Dan Gutfreund1, Chuang Gan1,2 1MIT-IBM Watson AI Lab 2UMass Amherst 3University of California, Los Angeles 4Carnegie Mellon University |
| Pseudocode | Yes | Algorithm 1: Pipeline of the proposed VCTP |
| Open Source Code | Yes | Our code is available at https://github.com/UMass-Foundation-Model/Visual Co T.git |
| Open Datasets | Yes | We evaluate our models on standard KB-VR benchmarks, OK-VQA (Marino et al. 2019) and A-OKVQA (Schwenk et al. 2022). |
| Dataset Splits | Yes | We compare our VCTP with baselines on the validation and test sets of the A-OKVQA dataset in Table 1. |
| Hardware Specification | No | The paper mentions 'our hardware configuration' but does not provide specific hardware details (e.g., GPU/CPU models, memory). |
| Software Dependencies | No | The paper lists several pre-trained models (e.g., Faster R-CNN, BLIP, OPT-66B, Llama-2-70B, CLIP (Vi T-B/16)) that were used, but does not provide version numbers for general software dependencies or libraries (e.g., Python, PyTorch versions). |
| Experiment Setup | Yes | We fix the number of the in-context examples in think module to 8 since it is the largest number we could efficiently run on our hardware configuration. Following (Yang et al. 2022), we prompt the LLM with in-context example selection and multi-query ensemble. For in-context examples, we select the examples most similar to the current image-question pair in training set with their clip features. For multi-query ensemble, we feed our models and the baselines 5 times and select the one with the highest log-probability as previous methods (Yang et al. 2022; Chen et al. 2021a) except the aligned models in Table 3, where we ensemble 14 times for baselines to make them have similar computation cost as ours. ... with the m Iter in algorithm 1 to be 5. |