Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Retrieval-Augmented Visual Question Answering via Built-in Autoregressive Search Engines
Authors: Xinwei Long, Zhiyuan Ma, Ermo Hua, Kaiyan Zhang, Biqing Qi, Bowen Zhou
AAAI 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on two representative OKVQA and A-OKVQA datasets demonstrate significant improvements ranging from 2.9% to 9.6% across all evaluation metrics when compared to strong baselines. |
| Researcher Affiliation | Academia | 1Department of Electronic Engineering, Tsinghua University 2Shanghai Artificial Intelligence Laboratory EMAIL |
| Pseudocode | No | The paper describes the methodology in narrative text and figures (e.g., Figure 2: The architecture of Re Au SE) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The code will be available at https://github.com/xinwei666/Re Au SE |
| Open Datasets | Yes | We focus on the knowledge-based VQA benchmarks, OKVQA (Marino et al. 2019) and A-OKVQA (Schwenk et al. 2022). Previous work provided two retrieval corpora, GS112K (Luo et al. 2021) and Wiki21M (Karpukhin et al. 2020), for the OKVQA dataset. Additionally, we introduce a new information-seeking dataset, Info Seek (Chen et al. 2023d), to evaluate the model s retrieval performance. |
| Dataset Splits | Yes | We strictly follow the settings of the original papers, using the corresponding metrics for each dataset. For the OKVQA dataset and the direct answer setting of the A-OKVQA dataset, we use the VQA score to evaluate the model s performance. For the multi-choice setting of the A-OKVQA dataset, we use accuracy for evaluation. |
| Hardware Specification | Yes | Each training stage is performed on four NVIDIA A6000 48G GPUs and completed within three hours. |
| Software Dependencies | Yes | Our model is implemented in Py Torch, utilizing version 0.3.0 of the PEFT library, which supports efficient switching between two Lo RA adapters during inference. |
| Experiment Setup | Yes | In our main experiments, we utilize Mini GPT4v2-7B as the base model, which employ Vi T-L/14 from pretrained CLIP as the image encoder and LLa Ma-v2-7B (Touvron et al. 2023) as the text encoder. We freeze all parameters of the MLLM, allowing updates only to the Lo RA parameters. We use the same MLLM in the three stages but apply two sets of Lo RA parameters to optimize the model respectively: one for retrieval and alignment, and the other for answer generation. |