Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Authors: Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, Yoshua Bengio
ICML 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate the use of attention with state-of-the-art performance on three benchmark datasets: Flickr9k, Flickr30k and MS COCO. and 5. Experiments We describe our experimental methodology and quantitative results which validate the effectiveness of our model for caption generation. |
| Researcher Affiliation | Academia | Universit e de Montr eal, University of Toronto, CIFAR |
| Pseudocode | No | The paper does not contain explicit pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | We make our code for these models publicly available to encourage future research in this area2. 2https://github.com/kelvinxu/ arctic-captions |
| Open Datasets | Yes | We report results on the widely-used Flickr8k and Flickr30k dataset as well as the more recenly introduced MS COCO dataset. Each image in the Flickr8k/30k dataset have 5 reference captions. In preprocessing our COCO dataset, we maintained a the same number of references between our datasets by discarding caption in excess of 5. We applied only basic tokenization to MS COCO so that it is consistent with the tokenization present in Flickr8k and Flickr30k. For all our experiments, we used a fixed vocabulary size of 10,000. (further supported by citations for Flickr8k (Hodosh et al., 2013), Flickr30k (Young et al., 2014) and MS COCO (Lin et al., 2014) in the contributions section). |
| Dataset Splits | Yes | In our reported results, we use the predefined splits of Flickr8k. However, for the Flickr30k and COCO datasets is the lack of standardized splits for which results are reported. As a result, we report the results with the publicly available splits5 used in previous work (Karpathy & Li, 2014). and We observed a breakdown in correlation between the validation set log-likelihood and BLEU in the later stages of training during our experiments. Since BLEU is the most commonly reported metric, we used BLEU on our validation set for model selection. |
| Hardware Specification | Yes | On our largest dataset (MS COCO), our soft attention model took less than 3 days to train on an NVIDIA Titan Black GPU. |
| Software Dependencies | No | The paper mentions software like 'Theano' and 'Whetlab' and optimization algorithms 'RMSProp' and 'Adam', but it does not specify concrete version numbers for these software dependencies (e.g., 'Theano 0.7' or 'Whetlab 1.0'). |
| Experiment Setup | Yes | Both variants of our attention model were trained with stochastic gradient descent using adaptive learning rates. For the Flickr8k dataset, we found that RMSProp (Tieleman & Hinton, 2012) worked best, while for Flickr30k/MS COCO dataset we for the recently proposed Adam algorithm (Kingma & Ba, 2014) to be quite effective. and during training we randomly sample a length and retrieve a mini-batch of size 64 of that length. and In addition to dropout (Srivastava et al., 2014), the only other regularization strategy we used was early stopping on BLEU score. and In training the deterministic version of our model, we introduce a form a doubly stochastic regularization that encourages the model to pay equal attention to every part of the image. and Ld = log(p(y|a)) + λ P t (P i αti − τ )2, (9) where we simply fixed τ to 1. and we used the Oxford VGGnet (Simonyan & Zisserman, 2014) pretrained on Image Net without finetuning. In our experiments we use the 14 14 512 feature map of the fourth convolutional layer before max pooling. This means our decoder operates on the flattened 196 512 (i.e L D) encoding. |