reproducibilityindex.ai

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Authors: Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, Mohamed Elhoseiny

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In our experiment, we found that the model trained on short image caption pairs could produce unnatural language outputs (e.g., repetition and fragmentation). To address this problem, we curate a detailed image description dataset in the second stage to finetune the model, which consequently improves the model s generation reliability and overall usability.
Researcher Affiliation	Academia	Deyao Zhu , Jun Chen , Xiaoqian Shen, Xiang Li, Mohamed Elhoseiny King Abdullah University of Science and Technology {deyao.zhu,jun.chen,xiaoqian.shen, xiang.li.1,mohamed.elhoseiny}@kaust.edu.sa
Pseudocode	No	The paper describes the methods verbally and through architectural diagrams (Figure 1), but does not contain explicit pseudocode or algorithm blocks.
Open Source Code	Yes	Our code, pre-trained model, and collected dataset are available at https://minigpt-4.github.io/.
Open Datasets	Yes	We utilize datasets from Conceptual Caption (Changpinyo et al., 2021; Sharma et al., 2018), SBU (Ordonez et al., 2011), and LAION (Schuhmann et al., 2021) for this process.
Dataset Splits	No	The paper states using various datasets for training (LAION, Conceptual Captions, SBU) and finetuning (curated 3,500 pairs) but does not provide explicit training, validation, or test splits for these datasets for reproducibility of the training process itself. While a COCO validation set is mentioned for evaluation, the overall data splitting for the primary training is not detailed.
Hardware Specification	Yes	Mini GPT-4 is initially trained for 20k steps using a batch size of 256 on 4 A100 GPUs, leveraging a combined image captioning dataset... and completes in around 10 hours on 4 A100 (80GB) GPUs.
Software Dependencies	No	The paper mentions several models and frameworks such as Vicuna, BLIP-2, LLaMA, ViT, Q-Former, Flan-T5, and EVA-CLIP, and the use of GPT-4 turbo for evaluation. However, it does not specify version numbers for general software dependencies like Python, PyTorch, or other libraries used for implementation.
Experiment Setup	Yes	The model undergoes 20,000 training steps with a batch size of 256, covering about 5 million image-text pairs, and completes in around 10 hours on 4 A100 (80GB) GPUs.