Exploring Diverse In-Context Configurations for Image Captioning
Authors: Xu Yang, Yongliang Wu, Mingzhuo Yang, Haokun Chen, Xin Geng
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our comprehensive experiments yield two counter-intuitive but valuable insights, highlighting the distinct characteristics of VL in-context learning due to multi-modal synergy, as compared to the NLP case. Furthermore, in our exploration of optimal combination strategies, we observed an average performance enhancement of 20.9 in CIDEr scores compared to the baseline. The code is given in https://github.com/yongliang-wu/Explore Cfg. |
| Researcher Affiliation | Academia | 1School of Computer Science & Engineering, Key Lab of New Generation Artificial Intelligence Technology & Its Interdisciplinary Applications (Ministry of Education), Southeast University 2The Chinese University of Hong Kong, Shenzhen |
| Pseudocode | No | The paper describes methods in narrative and with illustrative figures, but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code is given in https://github.com/yongliang-wu/Explore Cfg. |
| Open Datasets | Yes | MSCOCO. We evaluate the proposed strategies on MSCOCO dataset [14], which is the most widely used benchmark in image captioning. |
| Dataset Splits | Yes | We used the Karpathy split [59] in the experiments, which contains 113,287/5000/5000 training/validation/test images and each image is associated with 5 human-annotated captions. |
| Hardware Specification | Yes | We implement all experiments on a single RTX 3090 using FP16. |
| Software Dependencies | No | The paper mentions using “Open-Flamingo model [16]” and “Transformer” but does not provide specific version numbers for software dependencies such as Python, PyTorch, or CUDA, nor explicit version numbers for the Open-Flamingo software itself (only model versions are mentioned in appendix). |
| Experiment Setup | Yes | We employ the Open-Flamingo model [16] to test our strategies, setting the length penalty to -2.0 and a maximum generation length of 20. We follow Flamingo [6] to use 4, 8, 16, and 32 shots. |