Show, Recall, and Tell: Image Captioning with Recall Mechanism
Authors: Li Wang, Zechen Bai, Yonghua Zhang, Hongtao Lu12176-12183
AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our proposed methods (SG+RWS+WR) achieve BLEU-4 / CIDEr / SPICE scores of 36.6 / 116.9 / 21.3 with cross-entropy loss and 38.7 / 129.1 / 22.4 with CIDEr optimization on MSCOCO Karpathy test split, which surpass the results of other state-of-the-art methods. |
| Researcher Affiliation | Collaboration | 1Department of Computer Science and Engineering, Shanghai Jiao Tong University, China 2Institute of Software, Chinese Academy of Sciences, China 3AI-Lab Visual Search Team, Bytedance |
| Pseudocode | No | The paper describes its methods using mathematical formulas and descriptive text, but it does not include any pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any statement about making its source code publicly available or a link to a code repository. |
| Open Datasets | Yes | MSCOCO We use the MSCOCO 2014 captions dataset (Lin et al. 2014) to evaluate our proposed method. |
| Dataset Splits | Yes | In this paper, we employ the Karpathy splits (Karpathy and Fei-Fei 2015) for validation of model hyperparameters and offline evaluation. This split has been widely used in prior works, choosing 113,287 images with five captions each for training and 5000 respectively for validation and test. |
| Hardware Specification | Yes | Ge Force GTX 1080Ti is the GPU we employed in all experiments. |
| Software Dependencies | No | The paper mentions software components like 'word2vec', 'Adam optimizer', and 'Bi-LSTM', but does not provide specific version numbers for these or other relevant software dependencies or frameworks. |
| Experiment Setup | Yes | In our text-retrieval module, ... During training, the batch size is set to 128, and the margin α is set to 0.2. The learning rate is set to 5e-4 and decay by a factor 0.8 for every 3 epochs. In captioning model, ... The hidden units of LSTM1 and LSTM2 are both 1024, and the size of word embedding is also 1024. We adopt Adam optimizer with the learning rate set as 5e-4 and decay also by a factor 0.8 for every 3 epochs, and the batch size is set to 64. For CIDEr optimization training, we initialize the learning rate as 5e-5, decaying by a factor 0.1 for every 50 epochs. |