Uncertainty-Aware Image Captioning

Authors: Zhengcong Fei, Mingyuan Fan, Li Zhu, Junshi Huang, Xiaoming Wei, Xiaolin Wei

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on the MS COCO benchmark reveal that our approach outperforms the strong baseline and related methods on both captioning quality as well as decoding speed.
Researcher Affiliation Industry Meituan Beijing, China {feizhengcong, fanmingyuan, zhuli09, huangjunshi}@meituan.com {weixiaoming, weixiaolin02}@meituan.com
Pseudocode Yes Algorithm 1: DP-based Training Data Pair Construction
Open Source Code Yes In particular, to improve reproducibility and foster new research in the field, we publicly release the source code and trained models of all experiments.
Open Datasets Yes Dataset. We evaluate our proposed method on MS COCO (Chen et al. 2015), which is a standard benchmark for image captioning tasks. To be consistent with previous works, (Huang et al. 2019; Cornia et al. 2020), we adopted the Karpathy split (Karpathy and Fei-Fei 2015) that contains 113,287 training images equipped with five humanannotated sentences each and 5,000 images for validation and test splits, respectively.
Dataset Splits Yes To be consistent with previous works, (Huang et et al. 2019; Cornia et al. 2020), we adopted the Karpathy split (Karpathy and Fei-Fei 2015) that contains 113,287 training images equipped with five humanannotated sentences each and 5,000 images for validation and test splits, respectively.
Hardware Specification Yes The decoding time for speedup estimation is measured on a single image without minibatching and feature extraction, averaged over the whole test split with a 32G V100 GPU.
Software Dependencies No The paper mentions using the Adam optimizer, but it does not specify version numbers for any software dependencies like programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes Specifically, the number of stacked blocks is 3, the hidden size is 512, and feed-forward filter size is 2048 with a 0.2 dropout rate. During training, we train the UAIC model for 15 epochs with an initial learning rate of 3e-5 and decay it by 0.9 every five epochs with the combined loss presented in Equation 9 (He et al. 2019). Adam (Kingma and Ba 2014) optimizer with 3000 steps warm-up trick is employed.