Show, Recall, and Tell: Image Captioning with Recall Mechanism

Authors: Li Wang, Zechen Bai, Yonghua Zhang, Hongtao Lu12176-12183

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our proposed methods (SG+RWS+WR) achieve BLEU-4 / CIDEr / SPICE scores of 36.6 / 116.9 / 21.3 with cross-entropy loss and 38.7 / 129.1 / 22.4 with CIDEr optimization on MSCOCO Karpathy test split, which surpass the results of other state-of-the-art methods.
Researcher Affiliation Collaboration 1Department of Computer Science and Engineering, Shanghai Jiao Tong University, China 2Institute of Software, Chinese Academy of Sciences, China 3AI-Lab Visual Search Team, Bytedance
Pseudocode No The paper describes its methods using mathematical formulas and descriptive text, but it does not include any pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any statement about making its source code publicly available or a link to a code repository.
Open Datasets Yes MSCOCO We use the MSCOCO 2014 captions dataset (Lin et al. 2014) to evaluate our proposed method.
Dataset Splits Yes In this paper, we employ the Karpathy splits (Karpathy and Fei-Fei 2015) for validation of model hyperparameters and offline evaluation. This split has been widely used in prior works, choosing 113,287 images with five captions each for training and 5000 respectively for validation and test.
Hardware Specification Yes Ge Force GTX 1080Ti is the GPU we employed in all experiments.
Software Dependencies No The paper mentions software components like 'word2vec', 'Adam optimizer', and 'Bi-LSTM', but does not provide specific version numbers for these or other relevant software dependencies or frameworks.
Experiment Setup Yes In our text-retrieval module, ... During training, the batch size is set to 128, and the margin α is set to 0.2. The learning rate is set to 5e-4 and decay by a factor 0.8 for every 3 epochs. In captioning model, ... The hidden units of LSTM1 and LSTM2 are both 1024, and the size of word embedding is also 1024. We adopt Adam optimizer with the learning rate set as 5e-4 and decay also by a factor 0.8 for every 3 epochs, and the batch size is set to 64. For CIDEr optimization training, we initialize the learning rate as 5e-5, decaying by a factor 0.1 for every 50 epochs.