reproducibilityindex.ai

Exploring Diverse In-Context Configurations for Image Captioning

Authors: Xu Yang, Yongliang Wu, Mingzhuo Yang, Haokun Chen, Xin Geng

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our comprehensive experiments yield two counter-intuitive but valuable insights, highlighting the distinct characteristics of VL in-context learning due to multi-modal synergy, as compared to the NLP case. Furthermore, in our exploration of optimal combination strategies, we observed an average performance enhancement of 20.9 in CIDEr scores compared to the baseline. The code is given in https://github.com/yongliang-wu/Explore Cfg.
Researcher Affiliation	Academia	1School of Computer Science & Engineering, Key Lab of New Generation Artificial Intelligence Technology & Its Interdisciplinary Applications (Ministry of Education), Southeast University 2The Chinese University of Hong Kong, Shenzhen
Pseudocode	No	The paper describes methods in narrative and with illustrative figures, but does not include structured pseudocode or algorithm blocks.
Open Source Code	Yes	The code is given in https://github.com/yongliang-wu/Explore Cfg.
Open Datasets	Yes	MSCOCO. We evaluate the proposed strategies on MSCOCO dataset [14], which is the most widely used benchmark in image captioning.
Dataset Splits	Yes	We used the Karpathy split [59] in the experiments, which contains 113,287/5000/5000 training/validation/test images and each image is associated with 5 human-annotated captions.
Hardware Specification	Yes	We implement all experiments on a single RTX 3090 using FP16.
Software Dependencies	No	The paper mentions using “Open-Flamingo model [16]” and “Transformer” but does not provide specific version numbers for software dependencies such as Python, PyTorch, or CUDA, nor explicit version numbers for the Open-Flamingo software itself (only model versions are mentioned in appendix).
Experiment Setup	Yes	We employ the Open-Flamingo model [16] to test our strategies, setting the length penalty to -2.0 and a maximum generation length of 20. We follow Flamingo [6] to use 4, 8, 16, and 32 shots.