Temporal-Difference Learning With Sampling Baseline for Image Captioning

Authors: Hui Chen, Guiguang Ding, Sicheng Zhao, Jungong Han

AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that our proposed method can improve the quality of generated captions and outperforms the state-of-the-art methods on the benchmark dataset MS COCO in terms of seven evaluation metrics. We conduct a massive of experiments and comparisons with other methods. The results demonstrate that the proposed method has a significant superiority over the-stateof-the-art methods
Researcher Affiliation Academia School of Software, Tsinghua University, Beijing 100084, China School of Computing and Communications, Lancaster University, Lancaster, LA1 4YW, UK {jichenhui2012,schzhao,jungonghan77}@gmail.com, dinggg@tsinghua.edu.cn
Pseudocode No The paper describes the method using mathematical equations and textual explanations, but it does not include a clearly labeled pseudocode or algorithm block.
Open Source Code No The paper states 'We use the code publicly1 to preprocess the dataset' with a footnote to 'https://github.com/karpathy/neuraltalk'. This refers to third-party code used, not the authors' own source code for the methodology described in the paper.
Open Datasets Yes We evaluate our proposed method on the popular MS COCO dataset (Lin et al. 2014).
Dataset Splits Yes MS COCO dataset contains 123,287 images labeled with at least 5 captions including 82783 training images and 40504 validation images. MS COCO provides 40775 images as test set for online evaluation as well. Since the standard test set is not public, we use 5000 images for validation, 5000 images for test and the remains for training, as in previous works (Xu et al. 2015; You et al. 2016; Chen et al. 2017c) for offline evaluation.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory, or processing units) used for running its experiments.
Software Dependencies No The paper mentions software components like ResNet-101, ADAM optimizer, and LSTM, but does not provide specific version numbers for these or any other software dependencies, such as programming languages or libraries.
Experiment Setup Yes We train models under the XENT loss using ADAM optimizer with a learning rate of 5 10 4 and finetune the CNN from the beginning. For all models, the batch size is set to 16 and every 1K iterations the model evaluation will be performed during training. When training models under the RL loss, the learning rate for language model is initialized to 1 10 4 and set to 5 10 5 after 50K iterations, then decreased 1 10 5 every 100K iterations until 1 10 5. By default, the beam search size is fixed to 3 for all models for test.