reproducibilityindex.ai

A Multi-task Learning Approach for Image Captioning

Authors: Wei Zhao, Benyou Wang, Jianbo Ye, Min Yang, Zhou Zhao, Ruotian Luo, Yu Qiao

IJCAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The experimental results on MS-COCO dataset demonstrate that our model achieves impressive results compared to other strong competitors.
Researcher Affiliation	Collaboration	1 Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences 2 Tencent Inc. 3 Pennsylvania State University 4 Zhejiang University 5 Toyota Technological Institute at Chicago
Pseudocode	No	The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code	Yes	Codes are publicly available at https://goo.gl/i Zt CBB.
Open Datasets	Yes	In our experiment, we use the widely used MSCOCO 2014 image captions [Karpathy and Fei-Fei, 2015] as our dataset
Dataset Splits	Yes	For the off-line testing, we adopt the commonly used Karpathy split [Karpathy and Fei-Fei, 2015], which uses 113,287 images for training, and 5,000 images for validation and testing, respectively. For the on-line server evaluation, our model is trained on 118,287 images and validated on 5,000 images.
Hardware Specification	No	The paper does not specify the hardware used (e.g., CPU, GPU models, memory, etc.) for running the experiments.
Software Dependencies	No	The paper mentions 'Adam optimizer' and 'Easy SRL2' with a GitHub link, but does not specify version numbers for these or other software dependencies within the text.
Experiment Setup	Yes	We ﬁrst pre-train our model on the training data with cross-entropy cost, and use Adam optimizer with an initial learning rate 5 × 10−4 and a momentum parameter of 0.9 to optimize the parameters. After that, we run the proposed RL-based approach on the just trained model, which is directly optimized for the CIDEr metric. During this stage, we use Adam optimizer with learning rate 5 × 10−5. We set λ1=0.2, λ2=0.7, λ3=0.1. We set the number of hidden units in Top Down attention LSTM (LSTM(1)) to 1,000, the number of hidden units in language model LSTM (LSTM(2)) to 512, the size of the input word embedding to 512, and the size of the CCG supertag embedding to 100. During the decoding stage, we use a beam size of 5 to generate captions.