Controllable Image Captioning via Prompting

Authors: Ning Wang, Jiahao Xie, Jihao Wu, Mingbo Jia, Linlin Li

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments verify the controllable capability of the proposed method. Notably, we achieve outstanding performance on two diverse image captioning benchmarks including COCO Karpathy split and Text Caps using a unified model.
Researcher Affiliation Collaboration Ning Wang, Jiahao Xie, Jihao Wu, Mingbo Jia, Linlin Li Huawei Inc. wn6149@mail.ustc.edu.cn, jh xie@tongji.edu.cn, {wujihao, jiamingbo, lynn.lilinlin}@huawei.com
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement or link for the release of its source code.
Open Datasets Yes We evaluate the proposed method on the COCO caption dataset (Lin et al. 2014) of Karpathy split (Karpathy and Fei-Fei 2015), No Caps (Agrawal et al. 2019), and Text Caps (Sidorov et al. 2020).
Dataset Splits Yes We evaluate the proposed method on the COCO caption dataset (Lin et al. 2014) of Karpathy split (Karpathy and Fei-Fei 2015), No Caps (Agrawal et al. 2019), and Text Caps (Sidorov et al. 2020).
Hardware Specification Yes In the pre-training stage, the model is trained on 32 V100 GPUs.
Software Dependencies No The paper states 'Our model is implemented in Python with Py Torch' and mentions using BERT-base and Vi T-B/16, but it does not provide specific version numbers for Python, PyTorch, or other software dependencies.
Experiment Setup Yes We pre-train the whole model for 32 epochs using a batch size of 2880. We use Adam W optimizer (Loshchilov and Hutter 2017) with a weight decay of 0.05. The learning rate is warmed-up to 3 10 4 and decayed linearly with a rate of 0.85. In the fine-tuning stage, we train the model using a small learning rate of 1 10 5 and linearly decay it. The model is fine-tuned for 5 epochs. As for the prompt embedding P RN 768, we randomly initialize it and set N = 16.