reproducibilityindex.ai

Show, Observe and Tell: Attribute-driven Attention Model for Image Captioning

Authors: Hui Chen, Guiguang Ding, Zijia Lin, Sicheng Zhao, Jungong Han

IJCAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We verify the effectiveness and superiority of the proposed approach over other captioning approaches by conducting massive experiments and comparisons on the MS COCO image captioning dataset. We show the effectiveness and superiority of the proposed approach by conducting a massive of experiments and comparisons with other approaches on the MS COCO image captioning dataset.
Researcher Affiliation	Collaboration	Beijing National Research Center for Information Science and Technology(BNRist) School of Software, Tsinghua University, Beijing, China School of Computing & Communications, Lancaster University, UK Microsoft Research, Beijing, China {jichenhui2012,jungonghan77,schzhao}@gmail.com, dinggg@tsinghua.edu.cn, zijlin@microsoft.com
Pseudocode	No	The paper describes the model architecture and mathematical formulations but does not include structured pseudocode or algorithm blocks.
Open Source Code	No	The paper mentions 'https://github.com/tylin/coco-caption' for the MS COCO caption evaluation tool, which is a third-party tool, but does not provide a link or statement about the availability of their own source code.
Open Datasets	Yes	Following previous works [Yao et al., 2017; Yang et al., 2016], we conduct experiments on the popular MS COCO dataset [Lin et al., 2014], which consists of 82783 training images and 40504 validation images.
Dataset Splits	Yes	For ofﬂine evaluation, we follow most previous works [Chen et al., 2017a; Yao et al., 2017] and split the 123287 images into three parts, 5000 for validation, 5000 for test and the remains for training.
Hardware Specification	No	The paper discusses the models and optimizers used (Res Net-101, ADAM optimizer) but does not provide specific hardware details like GPU or CPU models used for training or inference.
Software Dependencies	No	The paper mentions using Res Net-101 and ADAM optimizer but does not list specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions, or specific library versions).
Experiment Setup	Yes	We convert all sentences to the lower case, and ﬁlter rare words that occur less than 5 times, and we end up with a vocabulary set of 9487 tokens. We use Res Net-101 [He et al., 2016] pre-trained on the Image Net dataset to extract the image features. We do not crop or scale any image. Instead, we use the ﬁnal convolutional layer of Res Net as image features, and apply spatially average pooling, resulting in a ﬁxed size of 14 14 2048 of the feature map. The hidden state size of LSTM, the embedding dimension of the input word and the embedding dimension of image features are all ﬁxed to 1000. During training, the parameters are updated by ADAM optimizer with a learning rate of 5 10 4 and 0.9 as the learning rate decay factor. We let the learning rate decay every 2 epoches and we train the model for 30 epoches with a batch size of 16. Following [Yao et al., 2017; Fang et al., 2015], we use the 1000 most frequent words in the training captions as the attribute vocabulary, which cover the majority of words in the training data. To train the inference module, for each image we rank the attributes according to their frequency. For caption generation in the testing stage, we apply the beam search algorithm to boost the performance. The beam size is empirically set to 3.