Hierarchical Attention Network for Image Captioning
Authors: Weixuan Wang, Zhihong Chen, Haifeng Hu8957-8964
AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The HAN is verified on benchmark MSCOCO dataset, and the experimental results indicate that our model outperforms the state-of-the-art methods, achieving a BLEU1 score of 80.9 and a CIDEr score of 121.7 in the Karpathy s test split. |
| Researcher Affiliation | Academia | Weixuan Wang, Zhihong Chen, Haifeng Hu School of Electronics and Information Technology, Sun Yat-sen University, Guangzhou 510275, China {wangwx25, chenzhh45}@mail2.sysu.edu.cn, huhaif@mail.sysu.edu.cn |
| Pseudocode | No | The paper does not contain explicit pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any statement about releasing source code or a link to a code repository. |
| Open Datasets | Yes | The MSCOCO dataset (Lin et al. 2014) is the benchmark dataset for image captioning, which contains 82,783, 40,504, and 40775 images for training, validation and test respectively. For offline evalution, we employ the Karpathy s splits (Karpathy and Li 2015) which contain 113,287 images for training, 5,000 images for validation and 5,000 images for test. |
| Dataset Splits | Yes | For offline evalution, we employ the Karpathy s splits (Karpathy and Li 2015) which contain 113,287 images for training, 5,000 images for validation and 5,000 images for test. |
| Hardware Specification | No | The paper mentions "Due to the limitation of the hardware" but does not specify any hardware details such as GPU/CPU models or memory. |
| Software Dependencies | No | The paper mentions using "Res Net101", "Faster RCNN", "ADAM optimizer", and "LSTM" but does not provide specific version numbers for any of these software components or libraries. |
| Experiment Setup | Yes | The dimension of these features are reduced to 512. The dimensions of embedding layers and both LSTMs are set to 512. Firstly, we train our model under cross entropy (XE) loss using ADAM optimizer with a learning rate 5e-4 and do not finetune the CNN. Afterwards, we perform the CIDEr standard optimization on the XE-trained model, and also use the Adam optimizer. In the decoding process, we use beam search and set beam size to 3. |