Show, Observe and Tell: Attribute-driven Attention Model for Image Captioning

Authors: Hui Chen, Guiguang Ding, Zijia Lin, Sicheng Zhao, Jungong Han

IJCAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We verify the effectiveness and superiority of the proposed approach over other captioning approaches by conducting massive experiments and comparisons on the MS COCO image captioning dataset. We show the effectiveness and superiority of the proposed approach by conducting a massive of experiments and comparisons with other approaches on the MS COCO image captioning dataset.
Researcher Affiliation Collaboration Beijing National Research Center for Information Science and Technology(BNRist) School of Software, Tsinghua University, Beijing, China School of Computing & Communications, Lancaster University, UK Microsoft Research, Beijing, China {jichenhui2012,jungonghan77,schzhao}@gmail.com, dinggg@tsinghua.edu.cn, zijlin@microsoft.com
Pseudocode No The paper describes the model architecture and mathematical formulations but does not include structured pseudocode or algorithm blocks.
Open Source Code No The paper mentions 'https://github.com/tylin/coco-caption' for the MS COCO caption evaluation tool, which is a third-party tool, but does not provide a link or statement about the availability of their own source code.
Open Datasets Yes Following previous works [Yao et al., 2017; Yang et al., 2016], we conduct experiments on the popular MS COCO dataset [Lin et al., 2014], which consists of 82783 training images and 40504 validation images.
Dataset Splits Yes For offline evaluation, we follow most previous works [Chen et al., 2017a; Yao et al., 2017] and split the 123287 images into three parts, 5000 for validation, 5000 for test and the remains for training.
Hardware Specification No The paper discusses the models and optimizers used (Res Net-101, ADAM optimizer) but does not provide specific hardware details like GPU or CPU models used for training or inference.
Software Dependencies No The paper mentions using Res Net-101 and ADAM optimizer but does not list specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions, or specific library versions).
Experiment Setup Yes We convert all sentences to the lower case, and filter rare words that occur less than 5 times, and we end up with a vocabulary set of 9487 tokens. We use Res Net-101 [He et al., 2016] pre-trained on the Image Net dataset to extract the image features. We do not crop or scale any image. Instead, we use the final convolutional layer of Res Net as image features, and apply spatially average pooling, resulting in a fixed size of 14 14 2048 of the feature map. The hidden state size of LSTM, the embedding dimension of the input word and the embedding dimension of image features are all fixed to 1000. During training, the parameters are updated by ADAM optimizer with a learning rate of 5 10 4 and 0.9 as the learning rate decay factor. We let the learning rate decay every 2 epoches and we train the model for 30 epoches with a batch size of 16. Following [Yao et al., 2017; Fang et al., 2015], we use the 1000 most frequent words in the training captions as the attribute vocabulary, which cover the majority of words in the training data. To train the inference module, for each image we rank the attributes according to their frequency. For caption generation in the testing stage, we apply the beam search algorithm to boost the performance. The beam size is empirically set to 3.