Generative Multi-Modal Knowledge Retrieval with Large Language Models

Authors: Xinwei Long, Jiali Zeng, Fandong Meng, Zhiyuan Ma, Kaiyan Zhang, Bowen Zhou, Jie Zhou

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through experiments conducted on three benchmarks, we demonstrate significant improvements ranging from 3.0% to 14.6% across all evaluation metrics when compared to strong baselines.
Researcher Affiliation Collaboration 1Department of Electronic Engineering, Tsinghua University, Beijing, China 2Pattern Recognition Center, We Chat AI, Tencent Inc, China
Pseudocode No The paper describes the model's architecture and processes but does not include a dedicated pseudocode or algorithm block.
Open Source Code No The code will be released in this repository. https://github.com/xinwei666/MMGenerative IR
Open Datasets Yes We conduct experiments on three benchmarks of multi-modal knowledge retrieval: OKVQA-GS112K (Luo et al. 2021a), OKVQA-WK21M (Luo et al. 2023b) and Re Muq (Luo et al. 2023b)
Dataset Splits Yes Dataset Train/Val/ Test KB size OKVQA-GS112K 8,062/896/5,046 OKVQA-WK21M 8,062/896/5,046 Re Muq 7,576/842/3,609
Hardware Specification Yes Training is performed on an NVIDIA A6000 48G GPU and completed within three hours.
Software Dependencies No Our model is implemented by Pytorch and trained using a learning rate of 6e-5, the Adam optimizer with a warmup strategy, and batches of 12 instruction data... We use YOLOv7 (Wang, Bochkovskiy, and Liao 2022) to obtain bounding boxes... The paper mentions software like Pytorch and YOLOv7 but does not provide specific version numbers for them.
Experiment Setup Yes Our model is implemented by Pytorch and trained using a learning rate of 6e-5, the Adam optimizer with a warmup strategy, and batches of 12 instruction data.