reproducibilityindex.ai

Controllable Image Captioning via Prompting

Authors: Ning Wang, Jiahao Xie, Jihao Wu, Mingbo Jia, Linlin Li

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments verify the controllable capability of the proposed method. Notably, we achieve outstanding performance on two diverse image captioning benchmarks including COCO Karpathy split and Text Caps using a uniﬁed model.
Researcher Affiliation	Collaboration	Ning Wang, Jiahao Xie, Jihao Wu, Mingbo Jia, Linlin Li Huawei Inc. wn6149@mail.ustc.edu.cn, jh xie@tongji.edu.cn, {wujihao, jiamingbo, lynn.lilinlin}@huawei.com
Pseudocode	No	The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide an explicit statement or link for the release of its source code.
Open Datasets	Yes	We evaluate the proposed method on the COCO caption dataset (Lin et al. 2014) of Karpathy split (Karpathy and Fei-Fei 2015), No Caps (Agrawal et al. 2019), and Text Caps (Sidorov et al. 2020).
Dataset Splits	Yes	We evaluate the proposed method on the COCO caption dataset (Lin et al. 2014) of Karpathy split (Karpathy and Fei-Fei 2015), No Caps (Agrawal et al. 2019), and Text Caps (Sidorov et al. 2020).
Hardware Specification	Yes	In the pre-training stage, the model is trained on 32 V100 GPUs.
Software Dependencies	No	The paper states 'Our model is implemented in Python with Py Torch' and mentions using BERT-base and Vi T-B/16, but it does not provide specific version numbers for Python, PyTorch, or other software dependencies.
Experiment Setup	Yes	We pre-train the whole model for 32 epochs using a batch size of 2880. We use Adam W optimizer (Loshchilov and Hutter 2017) with a weight decay of 0.05. The learning rate is warmed-up to 3 10 4 and decayed linearly with a rate of 0.85. In the ﬁne-tuning stage, we train the model using a small learning rate of 1 10 5 and linearly decay it. The model is ﬁne-tuned for 5 epochs. As for the prompt embedding P RN 768, we randomly initialize it and set N = 16.