MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning

Authors: Haozhe Zhao, Zefan Cai, Shuzheng Si, Xiaojian Ma, Kaikai An, Liang Chen, Zixuan Liu, Sheng Wang, Wenjuan Han, Baobao Chang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments confirm that MMICL achieves new state-of-the-art zero-shot performance on a wide range of general vision-language tasks, especially for complex benchmarks, including MME and MMBench. Our analysis demonstrates that MMICL effectively tackles the challenge of complex multi-modal prompt understanding and emerges the impressive ICL ability. Furthermore, we observe that MMICL successfully alleviates language bias in VLMs, a common issue for VLMs that often leads to hallucination when faced with extensive textual context.
Researcher Affiliation Academia Haozhe Zhao 1,2, Zefan Cai 1, Shuzheng Si 1, Xiaojian Ma3, Kaikai An1, Liang Chen1, Zixuan Liu4, Sheng Wang4, Wenjuan Han:5, Baobao Chang:1 1National Key Laboratory for Multimedia Information Processing, Peking University 2School of Software and Microelectronics, Peking University, China 3National Key Laboratory of General Artificial Intelligence, BIGAI 4Paul G. Allen School of Computer Science and Engineering, University of Washington 5Beijing Jiaotong University
Pseudocode Yes Algorithm 1 Image Declaration
Open Source Code Yes Our code, dataset, dataset tool, and model are available at https://github.com/PKUnlp-icler/MIC.
Open Datasets Yes Our dataset is automatically constructed based on existing datasets. Firstly, we created an image declaration for each instance in all datasets to produce datasets with explicit text-to-image reference. Secondly, we created an instruction template for each dataset and asked Chatgpt to rewrite instructions, filling in the data from the existing datasets to obtain a dataset with diverse instruction formats. Finally, we used those datasets with instructions to construct the MIC dataset according to our proposed context scheme. For the example presented in Fig. 3 and Fig. 4 (e.g., two people quarreling with each other), we constructed the data based on existing annotations (i.e., bounding boxes and the relation between bounding boxes) provided by the VCR dataset (Zellers et al., 2019). Additionally, we also constructed an in-context learning dataset by sampling examples from the original dataset. We also extracted eight frames per video from video datasets to generate the multi-modal data with interconnected images. Details are presented at Appendix D.
Dataset Splits Yes Table 9: Detailed task descriptions and statistics of our instruction tuning tasks, including all datasets in all types of tasks. The column Used indicates whether we use this dataset in the multi-modal in-context tuning stage. MS COCO (Lin et al., 2014) Yes 413,952 202,496 0 Custom
Hardware Specification Yes All experiments are conducted with 6 NVIDIA A40 GPUs with the zero2-offload (Rajbhandari et al., 2020) of Deepspeed (Rasley et al., 2020) with the trainer of huggingface transformers (Wolf et al., 2020).
Software Dependencies No The paper mentions using "Deepspeed (Rasley et al., 2020)" and "huggingface transformers (Wolf et al., 2020)" but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes All experiments are conducted with 6 NVIDIA A40 GPUs with the zero2-offload (Rajbhandari et al., 2020) of Deepspeed (Rasley et al., 2020) with the trainer of huggingface transformers (Wolf et al., 2020). The batch size is 10 and 4 for MMICL (FLAN-T5-XL) and MMICL (FLAN-T5-XXL), respectively. The largest MMICL (FLAN-T5-XXL) requires about two days for the Stage II. In Stage II, we train the model for three epochs with a lower learning rate of 1e 5.