Controllable Image Captioning via Prompting
Authors: Ning Wang, Jiahao Xie, Jihao Wu, Mingbo Jia, Linlin Li
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments verify the controllable capability of the proposed method. Notably, we achieve outstanding performance on two diverse image captioning benchmarks including COCO Karpathy split and Text Caps using a unified model. |
| Researcher Affiliation | Collaboration | Ning Wang, Jiahao Xie, Jihao Wu, Mingbo Jia, Linlin Li Huawei Inc. wn6149@mail.ustc.edu.cn, jh xie@tongji.edu.cn, {wujihao, jiamingbo, lynn.lilinlin}@huawei.com |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement or link for the release of its source code. |
| Open Datasets | Yes | We evaluate the proposed method on the COCO caption dataset (Lin et al. 2014) of Karpathy split (Karpathy and Fei-Fei 2015), No Caps (Agrawal et al. 2019), and Text Caps (Sidorov et al. 2020). |
| Dataset Splits | Yes | We evaluate the proposed method on the COCO caption dataset (Lin et al. 2014) of Karpathy split (Karpathy and Fei-Fei 2015), No Caps (Agrawal et al. 2019), and Text Caps (Sidorov et al. 2020). |
| Hardware Specification | Yes | In the pre-training stage, the model is trained on 32 V100 GPUs. |
| Software Dependencies | No | The paper states 'Our model is implemented in Python with Py Torch' and mentions using BERT-base and Vi T-B/16, but it does not provide specific version numbers for Python, PyTorch, or other software dependencies. |
| Experiment Setup | Yes | We pre-train the whole model for 32 epochs using a batch size of 2880. We use Adam W optimizer (Loshchilov and Hutter 2017) with a weight decay of 0.05. The learning rate is warmed-up to 3 10 4 and decayed linearly with a rate of 0.85. In the fine-tuning stage, we train the model using a small learning rate of 1 10 5 and linearly decay it. The model is fine-tuned for 5 epochs. As for the prompt embedding P RN 768, we randomly initialize it and set N = 16. |