What Factors Affect Multi-Modal In-Context Learning? An In-Depth Exploration
Authors: Libo Qin, Qiguang Chen, Hao Fei, Zhi Chen, Min Li, Wanxiang Che
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To this end, we investigate extensive experiments on the three core steps of MM-ICL including demonstration retrieval, demonstration ordering, and prompt construction using 6 vision large language models and 20 strategies. |
| Researcher Affiliation | Collaboration | School of Computer Science and Engineering, Central South University Research Center for Social Computing and Information Retrieval Harbin Institute of Technology Tsinghua University Byte Dance |
| Pseudocode | No | The paper does not contain any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The code for exploratory prompt work generally does not need to be released, and readers can easily use the prompts we report to directly reproduce the results. |
| Open Datasets | Yes | Following the setting of Li et al. [2023c], we systematically explore 4 tasks, including image-caption, visual question answering (VQA), image classification, and chain-of-thought reasoning, which come from M3IT [Li et al., 2023c] and M3Co T [Chen et al., 2024b] (as shown in Tables 2)... Table 2: Dataset in M3IT and M3Co T, where IC: Image Captioning, CLS: Classification, VQA: Visual Question Answering, R: Chain-of-Thought Reasoning (with NL rationale). Due to the cost, for each task, we evenly sampled 500 items according to the sub-dataset. |
| Dataset Splits | No | While they mention using a "validation dataset" for demonstration retrieval and sample 500 items for evaluation, the paper does not explicitly provide percentages or counts for training/validation/test splits of the overall datasets used for their experiments, nor does it cite predefined splits for these specific experiments. |
| Hardware Specification | Yes | In addition, all open source models complete inference on 2 A100 80G. |
| Software Dependencies | No | The paper mentions specific models and encoders (e.g., RoBERTa, CLIP-Vision Encoder, Bridge Tower) but does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | This baseline ranks samples based on similarity, with a delimiter and a 3-shot setting (see Appendix A for details). In addition, all open source models complete inference on 2 A100 80G. For all experiments, we select top-p from {0.95, 1} and adjust the temperature parameter within [0, 1]. Among them, temperature is the main error variable in this work. |