MemCap: Memorizing Style Knowledge for Image Captioning

Authors: Wentian Zhao, Xinxiao Wu, Xiaoxun Zhang12984-12992

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on two stylized image captioning datasets (Senti Cap and Flickr Style10K) demonstrate the effectiveness of our method. Extensive experiments on several datasets demonstrate the superior performance of our method compared with the state-of-the-art methods.
Researcher Affiliation Collaboration Wentian Zhao,1 Xinxiao Wu,1 Xiaoxun Zhang2 1Lab. of IIT, School of Computer Science, Beijing Institute of Technology, Beijing 100081, China 2Alibaba Group
Pseudocode Yes Algorithm 1 Training Procedure of Mem Cap
Open Source Code No The paper does not provide a specific link or an explicit statement about the public availability of the source code for the methodology described.
Open Datasets Yes The factual descriptions and corresponding images are from MSCOCO (Lin et al. 2014) dataset. The stylized descriptions are from Senti Cap dataset (Mathews, Xie, and He 2016) that includes positive and negative styles, and Flickr Style10K dataset (Gan et al. 2017) that includes humorous and romantic styles.
Dataset Splits Yes The Senti Cap dataset contains 2360 images from MSCOCO dataset, as well as 5013 positive sentences and 4500 negative sentences. For the positive sentences, we use 2994 sentences for training and 2019 sentences for testing, and for the negative sentences, we use 2991 sentences for training and 1509 sentences for testing. The original Flickr Style10K dataset is composed of 10,000 images and each image has one romantic description and one humorous description. However, only the official training split that contains 7,000 images is publicly available. Following (Guo et al. 2019), we randomly sample 6,000 images as our training split and the rest images are used for testing. ... The training set, validation set and test set contain 1000, 215 and 215 videos, respectively.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory amounts) used for running experiments were mentioned.
Software Dependencies No The paper mentions 'SRILM toolkit' and 'jieba toolkit' but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup Yes In the memory module, the size of memory matrices Ms and M s are both set to 300 100. In the captioner, the dimension of word embedding vector Ew is set to 300 and the dimensions of cell state of two LSTM layers are set to 512. The values of parameters λ1, λ2, λ3 in Equation 9 are set to 1.0, 1.0 and 0.5, respectively. In both pre-training stage and fine-tuning stage, the Adam optimizer (Kingma and Ba 2014) is applied. During pre-training, the learning rate is fixed at 5 10 4. During fine-tuning, the initial learning rate is set to 5 10 4 and decays at a rate of 0.8 for every 10 epochs.