MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Authors: Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, Mohamed Elhoseiny
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiment, we found that the model trained on short image caption pairs could produce unnatural language outputs (e.g., repetition and fragmentation). To address this problem, we curate a detailed image description dataset in the second stage to finetune the model, which consequently improves the model s generation reliability and overall usability. |
| Researcher Affiliation | Academia | Deyao Zhu , Jun Chen , Xiaoqian Shen, Xiang Li, Mohamed Elhoseiny King Abdullah University of Science and Technology {deyao.zhu,jun.chen,xiaoqian.shen, xiang.li.1,mohamed.elhoseiny}@kaust.edu.sa |
| Pseudocode | No | The paper describes the methods verbally and through architectural diagrams (Figure 1), but does not contain explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code, pre-trained model, and collected dataset are available at https://minigpt-4.github.io/. |
| Open Datasets | Yes | We utilize datasets from Conceptual Caption (Changpinyo et al., 2021; Sharma et al., 2018), SBU (Ordonez et al., 2011), and LAION (Schuhmann et al., 2021) for this process. |
| Dataset Splits | No | The paper states using various datasets for training (LAION, Conceptual Captions, SBU) and finetuning (curated 3,500 pairs) but does not provide explicit training, validation, or test splits for these datasets for reproducibility of the training process itself. While a COCO validation set is mentioned for evaluation, the overall data splitting for the primary training is not detailed. |
| Hardware Specification | Yes | Mini GPT-4 is initially trained for 20k steps using a batch size of 256 on 4 A100 GPUs, leveraging a combined image captioning dataset... and completes in around 10 hours on 4 A100 (80GB) GPUs. |
| Software Dependencies | No | The paper mentions several models and frameworks such as Vicuna, BLIP-2, LLaMA, ViT, Q-Former, Flan-T5, and EVA-CLIP, and the use of GPT-4 turbo for evaluation. However, it does not specify version numbers for general software dependencies like Python, PyTorch, or other libraries used for implementation. |
| Experiment Setup | Yes | The model undergoes 20,000 training steps with a batch size of 256, covering about 5 million image-text pairs, and completes in around 10 hours on 4 A100 (80GB) GPUs. |