Answer Mining from a Pool of Images: Towards Retrieval-Based Visual Question Answering
Authors: Abhirama Subramanyam Penamakuri, Manish Gupta, Mithun Das Gupta, Anand Mishra
IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our proposed framework achieves an accuracy of 76.5% and a fluency of 79.3% on the proposed dataset, namely RETVQA and also outperforms state-of-the-art methods by 4.9% and 11.8% on the image segment of the publicly available Web QA dataset on the accuracy and fluency metrics, respectively. |
| Researcher Affiliation | Collaboration | Abhirama Subramanyam Penamakuri1 , Manish Gupta2 , Mithun Das Gupta2 , Anand Mishra1 1Indian Institute of Technology Jodhpur 2Microsoft |
| Pseudocode | No | The paper includes a system overview diagram (Figure 4) but no explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | We make our data and implementation publicly available.1 1https://vl2g.github.io/projects/retvqa/ |
| Open Datasets | Yes | To this end, we present a derived dataset prepared from Visual Genome [Krishna et al., 2017], leveraging its questions and annotations of images. ... We make our data and implementation publicly available.1 |
| Dataset Splits | Yes | Train set questions 334K (80%) Val set questions 41K (10%) Test set questions 41K (10%) |
| Hardware Specification | Yes | Our relevance encoder and MI-BART were trained using 3 Nvidia RTX A6000 GPUs with a batch size of 96 and 256 while training and a batch size of 360 and 480 during testing, respectively. |
| Software Dependencies | No | We have implemented our framework in Py Torch [Paszke et al., 2019] and Hugging Face s transformers [Wolf et al., 2020] library. While these libraries are mentioned with their publication years, specific version numbers (e.g., PyTorch 1.9, transformers 4.0) are not provided. |
| Experiment Setup | Yes | We pretrain our relevance encoder on MS-COCO [Lin et al., 2014] with a constant learning rate of 1e-4 using Adam optimizer [Kingma and Ba, 2015]. Using the same optimiser, we finetune the relevance encoder on both datasets with a constant learning rate of 2e-5. ...we further finetune MI-BART on a multi-image QA task with a learning rate of 5e-5 using Adam optimizer with a linear warm-up of 10% of the total steps. |