Generative Multi-Modal Knowledge Retrieval with Large Language Models
Authors: Xinwei Long, Jiali Zeng, Fandong Meng, Zhiyuan Ma, Kaiyan Zhang, Bowen Zhou, Jie Zhou
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through experiments conducted on three benchmarks, we demonstrate significant improvements ranging from 3.0% to 14.6% across all evaluation metrics when compared to strong baselines. |
| Researcher Affiliation | Collaboration | 1Department of Electronic Engineering, Tsinghua University, Beijing, China 2Pattern Recognition Center, We Chat AI, Tencent Inc, China |
| Pseudocode | No | The paper describes the model's architecture and processes but does not include a dedicated pseudocode or algorithm block. |
| Open Source Code | No | The code will be released in this repository. https://github.com/xinwei666/MMGenerative IR |
| Open Datasets | Yes | We conduct experiments on three benchmarks of multi-modal knowledge retrieval: OKVQA-GS112K (Luo et al. 2021a), OKVQA-WK21M (Luo et al. 2023b) and Re Muq (Luo et al. 2023b) |
| Dataset Splits | Yes | Dataset Train/Val/ Test KB size OKVQA-GS112K 8,062/896/5,046 OKVQA-WK21M 8,062/896/5,046 Re Muq 7,576/842/3,609 |
| Hardware Specification | Yes | Training is performed on an NVIDIA A6000 48G GPU and completed within three hours. |
| Software Dependencies | No | Our model is implemented by Pytorch and trained using a learning rate of 6e-5, the Adam optimizer with a warmup strategy, and batches of 12 instruction data... We use YOLOv7 (Wang, Bochkovskiy, and Liao 2022) to obtain bounding boxes... The paper mentions software like Pytorch and YOLOv7 but does not provide specific version numbers for them. |
| Experiment Setup | Yes | Our model is implemented by Pytorch and trained using a learning rate of 6e-5, the Adam optimizer with a warmup strategy, and batches of 12 instruction data. |