Image Captioning with Multi-Context Synthetic Data
Authors: Feipeng Ma, Yizhou Zhou, Fengyun Rao, Yueyi Zhang, Xiaoyan Sun
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The effectiveness of our pipeline is validated through experimental results in both the in-domain and cross-domain settings, where it achieves state-of-the-art performance on well-known datasets such as MSCOCO, Flickr30k, and No Caps. |
| Researcher Affiliation | Collaboration | 1University of Science and Technology of China 2We Chat, Tencent Inc. 3Institute of Artificial Intelligence, Hefei Comprehensive National Science Center mafp@mail.ustc.edu.cn, {harryizzhou, fengyunrao}@tencent.com, {zhyuey, sunxiaoyan}@ustc.edu.cn |
| Pseudocode | No | The paper describes its method in text and via a diagram (Figure 2) but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any concrete access information (e.g., specific repository link, explicit code release statement, or code in supplementary materials) for the source code of the methodology described. |
| Open Datasets | Yes | For in-domain image captioning, we utilize MSCOCO (Lin et al. 2014) and Flickr30k (Young et al. 2014) datasets. Regarding cross-domain image captioning, we train our model on SS1M (Feng et al. 2019) dataset and evaluate its performance using MSCOCO and No Caps (Agrawal et al. 2019). |
| Dataset Splits | Yes | Following (Karpathy and Fei Fei 2015), we split MSCOCO into 118,287 for training, 4,000 for validation, and 1,000 for testing. We use the validation set of No Caps to evaluate performance in three settings: in-domain, near-domain, and out-of-domain. |
| Hardware Specification | Yes | All experiments are conducted using eight NVIDIA A100 GPUs. |
| Software Dependencies | Yes | For image generation, we utilize Stable Diffusion v1.4 at a 512 512 resolution with 20 sampling steps, and we speed up the diffusion model s sampling process using DPM-Solver (Lu et al. 2022). |
| Experiment Setup | Yes | During the generation stage, we use different group sizes for various datasets: 30 for MSCOCO, 20 for Flickr30k, and 10 for SS1M. We train the model for 30 epochs using Adam (Kingma and Ba 2015) and a batch size of 36. The learning rate is 1e-5, and a warmup strategy is applied during training. Additionally, the input synthetic images are resized to 384 384. For inference, we follow the BLIP to use beam search with a beam size of 3. |