Answer Mining from a Pool of Images: Towards Retrieval-Based Visual Question Answering

Authors: Abhirama Subramanyam Penamakuri, Manish Gupta, Mithun Das Gupta, Anand Mishra

IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our proposed framework achieves an accuracy of 76.5% and a fluency of 79.3% on the proposed dataset, namely RETVQA and also outperforms state-of-the-art methods by 4.9% and 11.8% on the image segment of the publicly available Web QA dataset on the accuracy and fluency metrics, respectively.
Researcher Affiliation Collaboration Abhirama Subramanyam Penamakuri1 , Manish Gupta2 , Mithun Das Gupta2 , Anand Mishra1 1Indian Institute of Technology Jodhpur 2Microsoft
Pseudocode No The paper includes a system overview diagram (Figure 4) but no explicit pseudocode or algorithm blocks.
Open Source Code Yes We make our data and implementation publicly available.1 1https://vl2g.github.io/projects/retvqa/
Open Datasets Yes To this end, we present a derived dataset prepared from Visual Genome [Krishna et al., 2017], leveraging its questions and annotations of images. ... We make our data and implementation publicly available.1
Dataset Splits Yes Train set questions 334K (80%) Val set questions 41K (10%) Test set questions 41K (10%)
Hardware Specification Yes Our relevance encoder and MI-BART were trained using 3 Nvidia RTX A6000 GPUs with a batch size of 96 and 256 while training and a batch size of 360 and 480 during testing, respectively.
Software Dependencies No We have implemented our framework in Py Torch [Paszke et al., 2019] and Hugging Face s transformers [Wolf et al., 2020] library. While these libraries are mentioned with their publication years, specific version numbers (e.g., PyTorch 1.9, transformers 4.0) are not provided.
Experiment Setup Yes We pretrain our relevance encoder on MS-COCO [Lin et al., 2014] with a constant learning rate of 1e-4 using Adam optimizer [Kingma and Ba, 2015]. Using the same optimiser, we finetune the relevance encoder on both datasets with a constant learning rate of 2e-5. ...we further finetune MI-BART on a multi-image QA task with a learning rate of 5e-5 using Adam optimizer with a linear warm-up of 10% of the total steps.