Improving Image Captioning with Conditional Generative Adversarial Nets

Authors: Chen Chen, Shuai Mu, Wanpeng Xiao, Zexiong Ye, Liesi Wu, Qi Ju8142-8150

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we show consistent improvements over all language evaluation metrics for different state-of-the-art image captioning models.
Researcher Affiliation Industry Chen Chen, Shuai Mu, Wanpeng Xiao, Zexiong Ye, Liesi Wu, Qi Ju Tencent AI Lab, Shenzhen, China 518000 {beckhamchen, harrymou, wanpengxiao, joeyye, henrylwu, damonju}@tencent.com
Pseudocode Yes Algorithm 1 describes the image captioning algorithm via the generative adversarial training method in detail.
Open Source Code No The paper does not explicitly state that the source code for the described methodology is released or provide a link to a code repository.
Open Datasets Yes The most widely used image captioning training and evaluation dataset is the MSCOCO dataset (Lin et al. 2014) which contains 82,783, 40,504, and 40,775 images with 5 captions each for training, validation, and test, respectively. For offline evaluation, following the Karpathy splits from (Karpathy and Fei-Fei 2015), we use a set of 5K images for validation, 5K images for test and 113,287 images for training.
Dataset Splits Yes The most widely used image captioning training and evaluation dataset is the MSCOCO dataset (Lin et al. 2014) which contains 82,783, 40,504, and 40,775 images with 5 captions each for training, validation, and test, respectively. For offline evaluation, following the Karpathy splits from (Karpathy and Fei-Fei 2015), we use a set of 5K images for validation, 5K images for test and 113,287 images for training.
Hardware Specification No The paper does not provide specific details about the hardware used, such as GPU or CPU models. It mentions 'training stage' and 'experimental experience' but no hardware specifications.
Software Dependencies No The paper mentions 'ADAM (Kingma and Ba 2014) optimizer' but does not provide specific version numbers for any software, libraries, or dependencies used in the experiments.
Experiment Setup Yes The LSTM hidden dimension for the RNN-based discriminator is 512. The dimension of image CNN feature and word embedding for both CNN-based and RNN-based discriminators is fixed to 2048. We initialize the discriminator via pre-training the model for 10 epochs by minimizing the cross entropy loss in Eq. (12) using the ADAM (Kingma and Ba 2014) optimizer with a batch size of 16, an initial learning rate of 1 10 3 and momentum of 0.9 and 0.999. Similarly, the generator is also pre-trained by MLE for 25 epochs. We use a beam search with a beam size of 5 when validating and tesing. The final optimal hyper-parameters of our proposed algorithm are λ = 0.3, g = 1, d = 1 and Q = CIDEr-D.