Hierarchical Attention Network for Image Captioning

Authors: Weixuan Wang, Zhihong Chen, Haifeng Hu8957-8964

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The HAN is verified on benchmark MSCOCO dataset, and the experimental results indicate that our model outperforms the state-of-the-art methods, achieving a BLEU1 score of 80.9 and a CIDEr score of 121.7 in the Karpathy s test split.
Researcher Affiliation Academia Weixuan Wang, Zhihong Chen, Haifeng Hu School of Electronics and Information Technology, Sun Yat-sen University, Guangzhou 510275, China {wangwx25, chenzhh45}@mail2.sysu.edu.cn, huhaif@mail.sysu.edu.cn
Pseudocode No The paper does not contain explicit pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any statement about releasing source code or a link to a code repository.
Open Datasets Yes The MSCOCO dataset (Lin et al. 2014) is the benchmark dataset for image captioning, which contains 82,783, 40,504, and 40775 images for training, validation and test respectively. For offline evalution, we employ the Karpathy s splits (Karpathy and Li 2015) which contain 113,287 images for training, 5,000 images for validation and 5,000 images for test.
Dataset Splits Yes For offline evalution, we employ the Karpathy s splits (Karpathy and Li 2015) which contain 113,287 images for training, 5,000 images for validation and 5,000 images for test.
Hardware Specification No The paper mentions "Due to the limitation of the hardware" but does not specify any hardware details such as GPU/CPU models or memory.
Software Dependencies No The paper mentions using "Res Net101", "Faster RCNN", "ADAM optimizer", and "LSTM" but does not provide specific version numbers for any of these software components or libraries.
Experiment Setup Yes The dimension of these features are reduced to 512. The dimensions of embedding layers and both LSTMs are set to 512. Firstly, we train our model under cross entropy (XE) loss using ADAM optimizer with a learning rate 5e-4 and do not finetune the CNN. Afterwards, we perform the CIDEr standard optimization on the XE-trained model, and also use the Adam optimizer. In the decoding process, we use beam search and set beam size to 3.