Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Answer Mining from a Pool of Images: Towards Retrieval-Based Visual Question Answering
Authors: Abhirama Subramanyam Penamakuri, Manish Gupta, Mithun Das Gupta, Anand Mishra
IJCAI 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our proposed framework achieves an accuracy of 76.5% and a fluency of 79.3% on the proposed dataset, namely RETVQA and also outperforms state-of-the-art methods by 4.9% and 11.8% on the image segment of the publicly available Web QA dataset on the accuracy and fluency metrics, respectively. |
| Researcher Affiliation | Collaboration | Abhirama Subramanyam Penamakuri1 , Manish Gupta2 , Mithun Das Gupta2 , Anand Mishra1 1Indian Institute of Technology Jodhpur 2Microsoft |
| Pseudocode | No | The paper includes a system overview diagram (Figure 4) but no explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | We make our data and implementation publicly available.1 1https://vl2g.github.io/projects/retvqa/ |
| Open Datasets | Yes | To this end, we present a derived dataset prepared from Visual Genome [Krishna et al., 2017], leveraging its questions and annotations of images. ... We make our data and implementation publicly available.1 |
| Dataset Splits | Yes | Train set questions 334K (80%) Val set questions 41K (10%) Test set questions 41K (10%) |
| Hardware Specification | Yes | Our relevance encoder and MI-BART were trained using 3 Nvidia RTX A6000 GPUs with a batch size of 96 and 256 while training and a batch size of 360 and 480 during testing, respectively. |
| Software Dependencies | No | We have implemented our framework in Py Torch [Paszke et al., 2019] and Hugging Face s transformers [Wolf et al., 2020] library. While these libraries are mentioned with their publication years, specific version numbers (e.g., PyTorch 1.9, transformers 4.0) are not provided. |
| Experiment Setup | Yes | We pretrain our relevance encoder on MS-COCO [Lin et al., 2014] with a constant learning rate of 1e-4 using Adam optimizer [Kingma and Ba, 2015]. Using the same optimiser, we finetune the relevance encoder on both datasets with a constant learning rate of 2e-5. ...we further finetune MI-BART on a multi-image QA task with a learning rate of 5e-5 using Adam optimizer with a linear warm-up of 10% of the total steps. |