What Factors Affect Multi-Modal In-Context Learning? An In-Depth Exploration

Authors: Libo Qin, Qiguang Chen, Hao Fei, Zhi Chen, Min Li, Wanxiang Che

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To this end, we investigate extensive experiments on the three core steps of MM-ICL including demonstration retrieval, demonstration ordering, and prompt construction using 6 vision large language models and 20 strategies.
Researcher Affiliation Collaboration School of Computer Science and Engineering, Central South University Research Center for Social Computing and Information Retrieval Harbin Institute of Technology Tsinghua University Byte Dance
Pseudocode No The paper does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The code for exploratory prompt work generally does not need to be released, and readers can easily use the prompts we report to directly reproduce the results.
Open Datasets Yes Following the setting of Li et al. [2023c], we systematically explore 4 tasks, including image-caption, visual question answering (VQA), image classification, and chain-of-thought reasoning, which come from M3IT [Li et al., 2023c] and M3Co T [Chen et al., 2024b] (as shown in Tables 2)... Table 2: Dataset in M3IT and M3Co T, where IC: Image Captioning, CLS: Classification, VQA: Visual Question Answering, R: Chain-of-Thought Reasoning (with NL rationale). Due to the cost, for each task, we evenly sampled 500 items according to the sub-dataset.
Dataset Splits No While they mention using a "validation dataset" for demonstration retrieval and sample 500 items for evaluation, the paper does not explicitly provide percentages or counts for training/validation/test splits of the overall datasets used for their experiments, nor does it cite predefined splits for these specific experiments.
Hardware Specification Yes In addition, all open source models complete inference on 2 A100 80G.
Software Dependencies No The paper mentions specific models and encoders (e.g., RoBERTa, CLIP-Vision Encoder, Bridge Tower) but does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes This baseline ranks samples based on similarity, with a delimiter and a 3-shot setting (see Appendix A for details). In addition, all open source models complete inference on 2 A100 80G. For all experiments, we select top-p from {0.95, 1} and adjust the temperature parameter within [0, 1]. Among them, temperature is the main error variable in this work.