Stack-Captioning: Coarse-to-Fine Learning for Image Captioning
Authors: Jiuxiang Gu, Jianfei Cai, Gang Wang, Tsuhan Chen
AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We extensively evaluate the proposed approach on MSCOCO and show that our approach can achieve the state-of-the-art performance. |
| Researcher Affiliation | Collaboration | 1 ROSE Lab, Interdisciplinary Graduate School, Nanyang Technological University, Singapore 2 School of Computer Science and Engineering, Nanyang Technological University, Singapore 3 Alibaba AI Labs, Hangzhou, China |
| Pseudocode | No | The paper describes the model architecture and equations but does not present any pseudocode or algorithm blocks. |
| Open Source Code | No | The paper states: "We report all the results using MSCOCO caption evaluation tool1. 1https://github.com/tylin/coco-caption". This link is for the evaluation tool, not the authors' own source code for their proposed method. |
| Open Datasets | Yes | We evaluate the proposed approach on MSCOCO dataset. The dataset contrains 123,000 images, where each image has five reference captions. We follow the setting of (Karpathy and Fei-Fei 2015) by using 5,000 images for offline validation and 5,000 images for offline testing. |
| Dataset Splits | Yes | We follow the setting of (Karpathy and Fei-Fei 2015) by using 5,000 images for offline validation and 5,000 images for offline testing. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU models, CPU types) used for running the experiments. |
| Software Dependencies | No | In our implementation, the parameters are randomly initialized except the image CNN, for which we encode the full image with the Res Net-101 pre-trained on Image Net. We first train our model under the cross-entropy cost using Adam (Kingma and Ba 2015) optimizer with an initial learning rate of 4 10 4 and a momentum parameter of 0.9. The paper mentions software like ResNet-101 and Adam optimizer, but does not provide specific version numbers for these or other libraries. |
| Experiment Setup | Yes | In this paper, we set the number of hidden units of each LSTM to 512, the number of hidden units in the attention layer to 512, and the vocabulary size of the word embedding to 9,487. In our implementation, the parameters are randomly initialized except the image CNN, for which we encode the full image with the Res Net-101 pre-trained on Image Net. We first train our model under the cross-entropy cost using Adam (Kingma and Ba 2015) optimizer with an initial learning rate of 4 10 4 and a momentum parameter of 0.9. After that, we run the proposed RL-based approach on the just trained model to be optimized for the CIDEr metric. During this stage, we use Adam with a learning rate 5 10 5. After each epoch, we evaluate the model on the validation set and select the model with the best CIDEr score for testing. During testing, we apply beam search which can increase the performance of greedy decoding. Unlike greedy decoding which keeps only a single hypothesis during decoding, Beam search keeps K > 1 (K = 5 in our experiments) hypotheses that have the highest scores at each time step, and returns the hypothesis with the highest log probability at the end. |