Stack-Captioning: Coarse-to-Fine Learning for Image Captioning

Authors: Jiuxiang Gu, Jianfei Cai, Gang Wang, Tsuhan Chen

AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We extensively evaluate the proposed approach on MSCOCO and show that our approach can achieve the state-of-the-art performance.
Researcher Affiliation Collaboration 1 ROSE Lab, Interdisciplinary Graduate School, Nanyang Technological University, Singapore 2 School of Computer Science and Engineering, Nanyang Technological University, Singapore 3 Alibaba AI Labs, Hangzhou, China
Pseudocode No The paper describes the model architecture and equations but does not present any pseudocode or algorithm blocks.
Open Source Code No The paper states: "We report all the results using MSCOCO caption evaluation tool1. 1https://github.com/tylin/coco-caption". This link is for the evaluation tool, not the authors' own source code for their proposed method.
Open Datasets Yes We evaluate the proposed approach on MSCOCO dataset. The dataset contrains 123,000 images, where each image has five reference captions. We follow the setting of (Karpathy and Fei-Fei 2015) by using 5,000 images for offline validation and 5,000 images for offline testing.
Dataset Splits Yes We follow the setting of (Karpathy and Fei-Fei 2015) by using 5,000 images for offline validation and 5,000 images for offline testing.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU models, CPU types) used for running the experiments.
Software Dependencies No In our implementation, the parameters are randomly initialized except the image CNN, for which we encode the full image with the Res Net-101 pre-trained on Image Net. We first train our model under the cross-entropy cost using Adam (Kingma and Ba 2015) optimizer with an initial learning rate of 4 10 4 and a momentum parameter of 0.9. The paper mentions software like ResNet-101 and Adam optimizer, but does not provide specific version numbers for these or other libraries.
Experiment Setup Yes In this paper, we set the number of hidden units of each LSTM to 512, the number of hidden units in the attention layer to 512, and the vocabulary size of the word embedding to 9,487. In our implementation, the parameters are randomly initialized except the image CNN, for which we encode the full image with the Res Net-101 pre-trained on Image Net. We first train our model under the cross-entropy cost using Adam (Kingma and Ba 2015) optimizer with an initial learning rate of 4 10 4 and a momentum parameter of 0.9. After that, we run the proposed RL-based approach on the just trained model to be optimized for the CIDEr metric. During this stage, we use Adam with a learning rate 5 10 5. After each epoch, we evaluate the model on the validation set and select the model with the best CIDEr score for testing. During testing, we apply beam search which can increase the performance of greedy decoding. Unlike greedy decoding which keeps only a single hypothesis during decoding, Beam search keeps K > 1 (K = 5 in our experiments) hypotheses that have the highest scores at each time step, and returns the hypothesis with the highest log probability at the end.