Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Visual Chain-of-Thought Prompting for Knowledge-Based Visual Reasoning
Authors: Zhenfang Chen, Qinhong Zhou, Yikang Shen, Yining Hong, Zhiqing Sun, Dan Gutfreund, Chuang Gan
AAAI 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments on a range of knowledge-based visual reasoning datasets. We found our VCTP enjoys several benefits, 1). it achieves better performance than the previous few-shot learning baselines; |
| Researcher Affiliation | Collaboration | Zhenfang Chen1*, Qinhong Zhou2 , Yikang Shen1, Yining Hong3, Zhiqing Sun4, Dan Gutfreund1, Chuang Gan1,2 1MIT-IBM Watson AI Lab 2UMass Amherst 3University of California, Los Angeles 4Carnegie Mellon University |
| Pseudocode | Yes | Algorithm 1: Pipeline of the proposed VCTP |
| Open Source Code | Yes | Our code is available at https://github.com/UMass-Foundation-Model/Visual Co T.git |
| Open Datasets | Yes | We evaluate our models on standard KB-VR benchmarks, OK-VQA (Marino et al. 2019) and A-OKVQA (Schwenk et al. 2022). |
| Dataset Splits | Yes | We compare our VCTP with baselines on the validation and test sets of the A-OKVQA dataset in Table 1. |
| Hardware Specification | No | The paper mentions 'our hardware configuration' but does not provide specific hardware details (e.g., GPU/CPU models, memory). |
| Software Dependencies | No | The paper lists several pre-trained models (e.g., Faster R-CNN, BLIP, OPT-66B, Llama-2-70B, CLIP (Vi T-B/16)) that were used, but does not provide version numbers for general software dependencies or libraries (e.g., Python, PyTorch versions). |
| Experiment Setup | Yes | We fix the number of the in-context examples in think module to 8 since it is the largest number we could efficiently run on our hardware configuration. Following (Yang et al. 2022), we prompt the LLM with in-context example selection and multi-query ensemble. For in-context examples, we select the examples most similar to the current image-question pair in training set with their clip features. For multi-query ensemble, we feed our models and the baselines 5 times and select the one with the highest log-probability as previous methods (Yang et al. 2022; Chen et al. 2021a) except the aligned models in Table 3, where we ensemble 14 times for baselines to make them have similar computation cost as ours. ... with the m Iter in algorithm 1 to be 5. |