Text-Guided Attention Model for Image Captioning

Authors: Jonghwan Mun, Minsu Cho, Bohyung Han

AAAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate our model on MSCOCO Captioning benchmark and achieve the state-of-the-art performance in standard metrics. Experiments This section describes our experimental setting and presents quantitative and qualitative results of our algorithm in comparison to recent methods.
Researcher Affiliation Academia Jonghwan Mun, Minsu Cho, Bohyung Han Department of Computer Science and Engineering, POSTECH, Korea {choco1916, mscho, bhhan}@postech.ac.kr
Pseudocode No The paper describes its algorithm in prose, but no structured pseudocode or algorithm blocks are provided.
Open Source Code No The paper does not provide an explicit statement or link for open-source code for the described methodology.
Open Datasets Yes We train our model on MS-COCO dataset (Lin et al. 2014), which contains 123,287 images.
Dataset Splits Yes The images are divided into 82,783 training images and 40,504 validation images. Each split of validation and testing data contains randomly selected 5,000 images from the original validation images.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory) used for running experiments are provided in the paper.
Software Dependencies No The paper mentions optimizers and neural network architectures but does not specify version numbers for any key software libraries or dependencies.
Experiment Setup Yes In the decoder, the dimensionalities of the word embedding space and the hidden state of LSTM are set to 512. We use Adam (Kingma and Ba 2015) to learn the model with mini-batch size of 80, where dropouts with 0.5 are applied to the output layer of decoder. The learning rate starts from 0.0004 and after 10 epochs decays by the factor of 0.8 at every three epoch. ... scheduled sampling... integrated in our learning procedure after 10 epochs with ground-truth word selection probability fixed to 0.75. ... we fix n = 60 and k = 10 in both training and testing.