Deliberate Attention Networks for Image Captioning

Authors: Lianli Gao, Kaixuan Fan, Jingkuan Song, Xianglong Liu, Xing Xu, Heng Tao Shen8320-8327

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our model improves the state-of-the-arts on the MSCOCO dataset and reaches 37.5% BELU-4, 28.5% METEOR and 125.6% CIDEr. It also outperforms the-state-of-the-arts from 25.1% BLEU-4, 20.4% METEOR and 53.1% CIDEr to 29.4% BLEU-4, 23.0% METEOR and 66.6% on the Flickr30K dataset.
Researcher Affiliation Academia Lianli Gao,1 Kaixuan Fan,1 Jingkuan Song,1 Xianglong Liu,2 Xing Xu,1 Heng Tao Shen1 1Center for Future Media and School of Computer Science and Engineering, University of Electronic Science and Technology of China. 2Beihang University, China. {lianli.gao,201722060722,xing.xu}@uestc.edu.cn, {jingkuan.song}@gmail.com, xlliu@nlsde.buaa.edu.cn,shenhengtao@hotmail.com
Pseudocode No The paper describes the method using mathematical equations and textual explanations, but does not include any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement about making the source code available or a direct link to a code repository for the methodology described in this paper.
Open Datasets Yes In this paper, we utilize two datasets including COCO (Lin et al. 2014) and Flickr30K (Young et al. 2014) to evaluate the performance of our proposed DA network.
Dataset Splits Yes For COCO... 113, 287, 5, 000 and 5, 000 images are used for training, validation and testing, respectively. [...] For Flickr30K... we use 29k images for training, 1k for validation and 1k for testing.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running its experiments.
Software Dependencies No The paper mentions using pre-trained Res Net-101 and Faster R-CNN features, and the Adam optimizer, but does not specify version numbers for any software dependencies or libraries (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes In addition, the hidden state size of two LSTM in our DA network is set to be 512. [...] The dimension of word embedding is 512, the RNN hidden state size is set as 1024 and the dimension of embedded image feature is 1024. [...] For MLE training, the epoch is set as 150 for both COCO and Flickr30K, while for reinforcement learning the epoch is set as 200 for COCO and 150 for Flickr30K. All models are trained by using Adam and the batch size is set as 128. We initialize the learning rate with 5e-4 and update it by a decreasing factor 0.8 in every 15 epochs. When conduct testing, beam search is applied to predict captions, with beam size setting as 5.