reproducibilityindex.ai

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Authors: Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, Yoshua Bengio

ICML 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate the use of attention with state-of-the-art performance on three benchmark datasets: Flickr9k, Flickr30k and MS COCO. and 5. Experiments We describe our experimental methodology and quantitative results which validate the effectiveness of our model for caption generation.
Researcher Affiliation	Academia	Universit e de Montr eal, University of Toronto, CIFAR
Pseudocode	No	The paper does not contain explicit pseudocode or clearly labeled algorithm blocks.
Open Source Code	Yes	We make our code for these models publicly available to encourage future research in this area2. 2https://github.com/kelvinxu/ arctic-captions
Open Datasets	Yes	We report results on the widely-used Flickr8k and Flickr30k dataset as well as the more recenly introduced MS COCO dataset. Each image in the Flickr8k/30k dataset have 5 reference captions. In preprocessing our COCO dataset, we maintained a the same number of references between our datasets by discarding caption in excess of 5. We applied only basic tokenization to MS COCO so that it is consistent with the tokenization present in Flickr8k and Flickr30k. For all our experiments, we used a ﬁxed vocabulary size of 10,000. (further supported by citations for Flickr8k (Hodosh et al., 2013), Flickr30k (Young et al., 2014) and MS COCO (Lin et al., 2014) in the contributions section).
Dataset Splits	Yes	In our reported results, we use the predeﬁned splits of Flickr8k. However, for the Flickr30k and COCO datasets is the lack of standardized splits for which results are reported. As a result, we report the results with the publicly available splits5 used in previous work (Karpathy & Li, 2014). and We observed a breakdown in correlation between the validation set log-likelihood and BLEU in the later stages of training during our experiments. Since BLEU is the most commonly reported metric, we used BLEU on our validation set for model selection.
Hardware Specification	Yes	On our largest dataset (MS COCO), our soft attention model took less than 3 days to train on an NVIDIA Titan Black GPU.
Software Dependencies	No	The paper mentions software like 'Theano' and 'Whetlab' and optimization algorithms 'RMSProp' and 'Adam', but it does not specify concrete version numbers for these software dependencies (e.g., 'Theano 0.7' or 'Whetlab 1.0').
Experiment Setup	Yes	Both variants of our attention model were trained with stochastic gradient descent using adaptive learning rates. For the Flickr8k dataset, we found that RMSProp (Tieleman & Hinton, 2012) worked best, while for Flickr30k/MS COCO dataset we for the recently proposed Adam algorithm (Kingma & Ba, 2014) to be quite effective. and during training we randomly sample a length and retrieve a mini-batch of size 64 of that length. and In addition to dropout (Srivastava et al., 2014), the only other regularization strategy we used was early stopping on BLEU score. and In training the deterministic version of our model, we introduce a form a doubly stochastic regularization that encourages the model to pay equal attention to every part of the image. and Ld = log(p(y\|a)) + λ P t (P i αti − τ )2, (9) where we simply ﬁxed τ to 1. and we used the Oxford VGGnet (Simonyan & Zisserman, 2014) pretrained on Image Net without ﬁnetuning. In our experiments we use the 14 14 512 feature map of the fourth convolutional layer before max pooling. This means our decoder operates on the ﬂattened 196 512 (i.e L D) encoding.