reproducibilityindex.ai

Image Captioning with Multi-Context Synthetic Data

Authors: Feipeng Ma, Yizhou Zhou, Fengyun Rao, Yueyi Zhang, Xiaoyan Sun

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The effectiveness of our pipeline is validated through experimental results in both the in-domain and cross-domain settings, where it achieves state-of-the-art performance on well-known datasets such as MSCOCO, Flickr30k, and No Caps.
Researcher Affiliation	Collaboration	1University of Science and Technology of China 2We Chat, Tencent Inc. 3Institute of Artificial Intelligence, Hefei Comprehensive National Science Center mafp@mail.ustc.edu.cn, {harryizzhou, fengyunrao}@tencent.com, {zhyuey, sunxiaoyan}@ustc.edu.cn
Pseudocode	No	The paper describes its method in text and via a diagram (Figure 2) but does not include any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide any concrete access information (e.g., specific repository link, explicit code release statement, or code in supplementary materials) for the source code of the methodology described.
Open Datasets	Yes	For in-domain image captioning, we utilize MSCOCO (Lin et al. 2014) and Flickr30k (Young et al. 2014) datasets. Regarding cross-domain image captioning, we train our model on SS1M (Feng et al. 2019) dataset and evaluate its performance using MSCOCO and No Caps (Agrawal et al. 2019).
Dataset Splits	Yes	Following (Karpathy and Fei Fei 2015), we split MSCOCO into 118,287 for training, 4,000 for validation, and 1,000 for testing. We use the validation set of No Caps to evaluate performance in three settings: in-domain, near-domain, and out-of-domain.
Hardware Specification	Yes	All experiments are conducted using eight NVIDIA A100 GPUs.
Software Dependencies	Yes	For image generation, we utilize Stable Diffusion v1.4 at a 512 512 resolution with 20 sampling steps, and we speed up the diffusion model s sampling process using DPM-Solver (Lu et al. 2022).
Experiment Setup	Yes	During the generation stage, we use different group sizes for various datasets: 30 for MSCOCO, 20 for Flickr30k, and 10 for SS1M. We train the model for 30 epochs using Adam (Kingma and Ba 2015) and a batch size of 36. The learning rate is 1e-5, and a warmup strategy is applied during training. Additionally, the input synthetic images are resized to 384 384. For inference, we follow the BLIP to use beam search with a beam size of 3.