Image Captioning with Multi-Context Synthetic Data

Authors: Feipeng Ma, Yizhou Zhou, Fengyun Rao, Yueyi Zhang, Xiaoyan Sun

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The effectiveness of our pipeline is validated through experimental results in both the in-domain and cross-domain settings, where it achieves state-of-the-art performance on well-known datasets such as MSCOCO, Flickr30k, and No Caps.
Researcher Affiliation Collaboration 1University of Science and Technology of China 2We Chat, Tencent Inc. 3Institute of Artificial Intelligence, Hefei Comprehensive National Science Center mafp@mail.ustc.edu.cn, {harryizzhou, fengyunrao}@tencent.com, {zhyuey, sunxiaoyan}@ustc.edu.cn
Pseudocode No The paper describes its method in text and via a diagram (Figure 2) but does not include any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any concrete access information (e.g., specific repository link, explicit code release statement, or code in supplementary materials) for the source code of the methodology described.
Open Datasets Yes For in-domain image captioning, we utilize MSCOCO (Lin et al. 2014) and Flickr30k (Young et al. 2014) datasets. Regarding cross-domain image captioning, we train our model on SS1M (Feng et al. 2019) dataset and evaluate its performance using MSCOCO and No Caps (Agrawal et al. 2019).
Dataset Splits Yes Following (Karpathy and Fei Fei 2015), we split MSCOCO into 118,287 for training, 4,000 for validation, and 1,000 for testing. We use the validation set of No Caps to evaluate performance in three settings: in-domain, near-domain, and out-of-domain.
Hardware Specification Yes All experiments are conducted using eight NVIDIA A100 GPUs.
Software Dependencies Yes For image generation, we utilize Stable Diffusion v1.4 at a 512 512 resolution with 20 sampling steps, and we speed up the diffusion model s sampling process using DPM-Solver (Lu et al. 2022).
Experiment Setup Yes During the generation stage, we use different group sizes for various datasets: 30 for MSCOCO, 20 for Flickr30k, and 10 for SS1M. We train the model for 30 epochs using Adam (Kingma and Ba 2015) and a batch size of 36. The learning rate is 1e-5, and a warmup strategy is applied during training. Additionally, the input synthetic images are resized to 384 384. For inference, we follow the BLIP to use beam search with a beam size of 3.