Multi-Level Policy and Reward Reinforcement Learning for Image Captioning

Authors: Anan Liu, Ning Xu, Hanwang Zhang, Weizhi Nie, Yuting Su, Yongdong Zhang

IJCAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments and analysis on MSCOCO and Flickr30k show that the proposed framework can achieve competing performances with respect to different evaluation metrics. We perform comprehensive evaluations on MSCOCO and Flickr30k datasets. Our framework achieves the competing performances against state-of-the-art methods. Ablative studies showcase the effect of the proposed framework.
Researcher Affiliation Academia 1 School of Electrical and Information Engineering, Tianjin University, Tianjin, China 2 School of Computer Science and Engineering, Nanyang Technological University, Singapore 3 University of Science and Technology of China, Hefei, China liuanan@tju.edu.cn
Pseudocode No The paper describes its approach and training process using textual descriptions and mathematical formulations (e.g., equations 1-9), but it does not include any formally labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not include an explicit statement about releasing the source code for the described methodology or a direct link to a code repository. It only refers to a third-party evaluation tool: 'Microsoft COCO caption evaluation tool 1, https://github.com/tylin/coco-caption'.
Open Datasets Yes We evaluate our framework on captioning datasets: MSCOCO and Flickr30k. For fair comparison, we adopt the splits consistent with [Karpathy and Fei-Fei, 2017].
Dataset Splits Yes For fair comparison, we adopt the splits consistent with [Karpathy and Fei-Fei, 2017], which uses 5,000 images for validation and test on MSCOCO; 1,000 images for validation and test on Flickr30k.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU models, or cloud computing instance types used for running the experiments. It only describes software settings.
Software Dependencies No The paper states 'All experiments are implemented by Py Torch,' but it does not specify the version number of PyTorch or any other software dependencies, such as the COCO evaluation tool mentioned.
Experiment Setup Yes As shown in Figure 1, we take the output of the 2048-d pool5 layer from Res Net-101 as image feature I. We use one LSTM unit with 2048-d hidden layers to construct RNN, and the dimension of both linear mapping layers is set to 2048 512. In training, the LSTM hidden, image, word and attention embedding dimension are fixed to 512 for the word-level policy. We use Adam optimizer with an initial learning rate of 5 10 5 and minibatches of size 64. The maximum number of epochs is 30. The margin λ in Eq. 5, β in Eq. 9, and γ in Eq. 4 are set as 0.6, 0.6, and 0.2, respectively. In testing, beam search is set to 1.