Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA

Authors: Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Yumao Lu, Zicheng Liu, Lijuan Wang3081-3089

AAAI 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA. By using only 16 examples, PICa surpasses the supervised state of the art by an absolute +8.6 points on the OK-VQA dataset. We also benchmark PICa on VQAv2, where PICa also shows a decent few-shot performance. We conduct comprehensive experiments on the OK-VQA dataset (Marino et al. 2019). With a pre-trained captioning model (Vin VL) (Zhang et al. 2021), PICa achieves an accuracy of 46.9% in a few-shot manner, an absolute improvement of 7.5 points when compared with supervised state of the art (Wu et al. 2021). When enhanced with predicted image tags, the performance can be further boosted to 48.0. We also provide detailed ablation study and qualitative analysis to understand the effectiveness of PICa.
Researcher Affiliation	Industry	Microsoft Corporation EMAIL
Pseudocode	No	The paper describes its approach and mechanisms in prose and with diagrams, but does not include any explicit pseudocode blocks or algorithms.
Open Source Code	No	The paper does not provide any statement or link indicating that the source code for their methodology is openly available.
Open Datasets	Yes	We conduct comprehensive experiments on the OK-VQA dataset (Marino et al. 2019). The VQAv2 dataset (Goyal et al. 2017) annotates question-answer pairs based on the COCO image corpus (Lin et al. 2014).
Dataset Splits	Yes	We fine-tune the Vin VL-base pre-trained checkpoint with the COCO 2014 training set to obtain the image captions on the OK-VQA test set, which contains images from COCO 2014 validation set. We follow Frozen (Tsimpoukelli et al. 2021), and report the accuracy on the validation set.
Hardware Specification	No	The paper does not provide specific details about the hardware used to run the experiments, such as GPU models or CPU specifications.
Software Dependencies	No	The paper mentions software components and models used (e.g., "GPT-3 (Brown et al. 2020)", "Vin VL (Zhang et al. 2021)", "CLIP model (Vi T-B/16 variant) (Radford et al. 2021)", "public Microsoft Azure tagging API"), but does not specify version numbers for general software dependencies like Python, PyTorch, or the API itself.
Experiment Setup	Yes	Empirically, we show that converting image context into textual descriptions leads to a strong baseline for VQA. Figure 2 shows the inference-time interface of PICa, which approaches the VQA task by prompting GPT-3 with a constructed input prompt. The prompt is a word sequence that consists of context C (with a prompt head h and n in-context examples {xi, yi}n i=1) and VQA input x. We then concatenate C with the VQA input x shown in the green box to generate the prompt. GPT-3 takes the constructed prompt text as input, implicitly retrieving and reasoning the knowledge from the language model, and predicts the answer y as an open-ended text generation task. n = 16 is roughly the max number of examples that GPT-3 can take, with a max input length of 2049. We reselect in-context examples of shorter lengths if any prompt exceeds the max input length limit, which rarely happens with n = 16. Given an inference-time example x, we use n k in-context examples to generate k prompts. Among the k answer predictions, we select the one with the highest sum of log-probability P t log p LM(yt) as the final answer (Chen et al. 2021).