A Multi-task Learning Approach for Image Captioning
Authors: Wei Zhao, Benyou Wang, Jianbo Ye, Min Yang, Zhou Zhao, Ruotian Luo, Yu Qiao
IJCAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The experimental results on MS-COCO dataset demonstrate that our model achieves impressive results compared to other strong competitors. |
| Researcher Affiliation | Collaboration | 1 Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences 2 Tencent Inc. 3 Pennsylvania State University 4 Zhejiang University 5 Toyota Technological Institute at Chicago |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | Codes are publicly available at https://goo.gl/i Zt CBB. |
| Open Datasets | Yes | In our experiment, we use the widely used MSCOCO 2014 image captions [Karpathy and Fei-Fei, 2015] as our dataset |
| Dataset Splits | Yes | For the off-line testing, we adopt the commonly used Karpathy split [Karpathy and Fei-Fei, 2015], which uses 113,287 images for training, and 5,000 images for validation and testing, respectively. For the on-line server evaluation, our model is trained on 118,287 images and validated on 5,000 images. |
| Hardware Specification | No | The paper does not specify the hardware used (e.g., CPU, GPU models, memory, etc.) for running the experiments. |
| Software Dependencies | No | The paper mentions 'Adam optimizer' and 'Easy SRL2' with a GitHub link, but does not specify version numbers for these or other software dependencies within the text. |
| Experiment Setup | Yes | We first pre-train our model on the training data with cross-entropy cost, and use Adam optimizer with an initial learning rate 5 × 10−4 and a momentum parameter of 0.9 to optimize the parameters. After that, we run the proposed RL-based approach on the just trained model, which is directly optimized for the CIDEr metric. During this stage, we use Adam optimizer with learning rate 5 × 10−5. We set λ1=0.2, λ2=0.7, λ3=0.1. We set the number of hidden units in Top Down attention LSTM (LSTM(1)) to 1,000, the number of hidden units in language model LSTM (LSTM(2)) to 512, the size of the input word embedding to 512, and the size of the CCG supertag embedding to 100. During the decoding stage, we use a beam size of 5 to generate captions. |