Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering

Authors: Weizhe Lin, Jinghong Chen, Jingbiao Mei, Alexandru Coca, Bill Byrne

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This paper proposes Fine-grained Late-interaction Multi-modal Retrieval (FLMR) which significantly improves knowledge retrieval in RA-VQA. FLMR significantly improves the original RA-VQA retriever s PRRecall@5 by approximately 8%. Finally, we equipped RA-VQA with two state-of-the-art large multi-modal/language models to achieve 61% VQA score in the OK-VQA dataset.
Researcher Affiliation Academia Weizhe Lin, Jinghong Chen , Jingbiao Mei, Alexandru Coca, Bill Byrne Department of Engineering University of Cambridge Cambridge, United Kingdom CB2 1PZ {wl356, jc2124, jm2245, ac2123, wjb31}@cam.ac.uk
Pseudocode No The paper includes diagrams (e.g., Figure 1) and mathematical formulas, but it does not present any pseudocode or algorithm blocks with structured steps.
Open Source Code Yes Our implementations are released at https://github.com/Lin Weizhe Dragon/Retrieval-Augmented-Visual-Question-Answering.
Open Datasets Yes We focus on the OK-VQA dataset where a large portion of questions requires external knowledge (either commonsense or domain-specific) to answer. (1) Google Search Corpus for OK-VQA [Luo et al., 2021]: a passage corpus collected for answering OK-VQA questions. (2) Wikipedia Corpus for OK-VQA: we collect this corpus by gathering all Wikipedia passages on common objects and concepts (e.g. umbrella, dog, hat) and those containing any of the potential answers in OK-VQA training set. We also use 10% of the WIT dataset [Srinivasan et al., 2021], a corpus based on Wikipedia with image-text pairs, to train the mapping network for multi-modal alignment.
Dataset Splits Yes Table 6: OK-VQA dataset statistics. Category Number train questions 9,009 valid questions 5,046 images 14,055. (1) FVQA [Wang et al., 2017]: ...The average of 5 cross-validation splits is reported. (2) Infoseek [Chen et al., 2023b]: ...we split the official validation set again into validation and test sets ( 5200 questions).
Hardware Specification Yes We use 1 Nvidia A100 (80G) for all experiments.
Software Dependencies No The paper mentions software components and libraries like 'huggingface-transformers', 'Col BERTv2', 'FAISS', and 'huggingface-PEFT', and specific models like 'BLIP 2', 'DPR', 'T5'. However, it does not provide specific version numbers for these software packages themselves (e.g., PyTorch 1.9, Python 3.8), which is required for reproducibility.
Experiment Setup Yes In training the retrievers, we use learning rate 10 4, batch size 30, gradient accumulation steps 2 for 10k steps (for both DPR and FLMR retrievers). When training RA-VQA-v2 (T5-large), we use learning rate 6 10 5, batch size 2, gradient accumulation 16 for up to 20 epochs. We use a linearly-decaying scheduler to reduce learning rate from 6 10 5 to 0 after 20 epochs. We use Lo RA [Hu et al., 2022b] to train RA-VQA-v2 (BLIP2) with learning rate 10 4, batch size 4, gradient accumulation steps 16 for up to 6k steps. Lo RA is configured to use the default huggingface-PEFT setting: r=8, lora_alpha=32, lora_dropout=0.1.