An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA
Authors: Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Yumao Lu, Zicheng Liu, Lijuan Wang3081-3089
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA. By using only 16 examples, PICa surpasses the supervised state of the art by an absolute +8.6 points on the OK-VQA dataset. We also benchmark PICa on VQAv2, where PICa also shows a decent few-shot performance. We conduct comprehensive experiments on the OK-VQA dataset (Marino et al. 2019). With a pre-trained captioning model (Vin VL) (Zhang et al. 2021), PICa achieves an accuracy of 46.9% in a few-shot manner, an absolute improvement of 7.5 points when compared with supervised state of the art (Wu et al. 2021). When enhanced with predicted image tags, the performance can be further boosted to 48.0. We also provide detailed ablation study and qualitative analysis to understand the effectiveness of PICa. |
| Researcher Affiliation | Industry | Microsoft Corporation {zhengyang, zhe.gan, jianfw, xiaowei.hu, yumaolu, zliu, lijuanw}@microsoft.com |
| Pseudocode | No | The paper describes its approach and mechanisms in prose and with diagrams, but does not include any explicit pseudocode blocks or algorithms. |
| Open Source Code | No | The paper does not provide any statement or link indicating that the source code for their methodology is openly available. |
| Open Datasets | Yes | We conduct comprehensive experiments on the OK-VQA dataset (Marino et al. 2019). The VQAv2 dataset (Goyal et al. 2017) annotates question-answer pairs based on the COCO image corpus (Lin et al. 2014). |
| Dataset Splits | Yes | We fine-tune the Vin VL-base pre-trained checkpoint with the COCO 2014 training set to obtain the image captions on the OK-VQA test set, which contains images from COCO 2014 validation set. We follow Frozen (Tsimpoukelli et al. 2021), and report the accuracy on the validation set. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used to run the experiments, such as GPU models or CPU specifications. |
| Software Dependencies | No | The paper mentions software components and models used (e.g., "GPT-3 (Brown et al. 2020)", "Vin VL (Zhang et al. 2021)", "CLIP model (Vi T-B/16 variant) (Radford et al. 2021)", "public Microsoft Azure tagging API"), but does not specify version numbers for general software dependencies like Python, PyTorch, or the API itself. |
| Experiment Setup | Yes | Empirically, we show that converting image context into textual descriptions leads to a strong baseline for VQA. Figure 2 shows the inference-time interface of PICa, which approaches the VQA task by prompting GPT-3 with a constructed input prompt. The prompt is a word sequence that consists of context C (with a prompt head h and n in-context examples {xi, yi}n i=1) and VQA input x. We then concatenate C with the VQA input x shown in the green box to generate the prompt. GPT-3 takes the constructed prompt text as input, implicitly retrieving and reasoning the knowledge from the language model, and predicts the answer y as an open-ended text generation task. n = 16 is roughly the max number of examples that GPT-3 can take, with a max input length of 2049. We reselect in-context examples of shorter lengths if any prompt exceeds the max input length limit, which rarely happens with n = 16. Given an inference-time example x, we use n k in-context examples to generate k prompts. Among the k answer predictions, we select the one with the highest sum of log-probability P t log p LM(yt) as the final answer (Chen et al. 2021). |