Exploring Diverse In-Context Configurations for Image Captioning

Authors: Xu Yang, Yongliang Wu, Mingzhuo Yang, Haokun Chen, Xin Geng

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our comprehensive experiments yield two counter-intuitive but valuable insights, highlighting the distinct characteristics of VL in-context learning due to multi-modal synergy, as compared to the NLP case. Furthermore, in our exploration of optimal combination strategies, we observed an average performance enhancement of 20.9 in CIDEr scores compared to the baseline. The code is given in https://github.com/yongliang-wu/Explore Cfg.
Researcher Affiliation Academia 1School of Computer Science & Engineering, Key Lab of New Generation Artificial Intelligence Technology & Its Interdisciplinary Applications (Ministry of Education), Southeast University 2The Chinese University of Hong Kong, Shenzhen
Pseudocode No The paper describes methods in narrative and with illustrative figures, but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes The code is given in https://github.com/yongliang-wu/Explore Cfg.
Open Datasets Yes MSCOCO. We evaluate the proposed strategies on MSCOCO dataset [14], which is the most widely used benchmark in image captioning.
Dataset Splits Yes We used the Karpathy split [59] in the experiments, which contains 113,287/5000/5000 training/validation/test images and each image is associated with 5 human-annotated captions.
Hardware Specification Yes We implement all experiments on a single RTX 3090 using FP16.
Software Dependencies No The paper mentions using “Open-Flamingo model [16]” and “Transformer” but does not provide specific version numbers for software dependencies such as Python, PyTorch, or CUDA, nor explicit version numbers for the Open-Flamingo software itself (only model versions are mentioned in appendix).
Experiment Setup Yes We employ the Open-Flamingo model [16] to test our strategies, setting the length penalty to -2.0 and a maximum generation length of 20. We follow Flamingo [6] to use 4, 8, 16, and 32 shots.