Improving Image Captioning with Conditional Generative Adversarial Nets
Authors: Chen Chen, Shuai Mu, Wanpeng Xiao, Zexiong Ye, Liesi Wu, Qi Ju8142-8150
AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we show consistent improvements over all language evaluation metrics for different state-of-the-art image captioning models. |
| Researcher Affiliation | Industry | Chen Chen, Shuai Mu, Wanpeng Xiao, Zexiong Ye, Liesi Wu, Qi Ju Tencent AI Lab, Shenzhen, China 518000 {beckhamchen, harrymou, wanpengxiao, joeyye, henrylwu, damonju}@tencent.com |
| Pseudocode | Yes | Algorithm 1 describes the image captioning algorithm via the generative adversarial training method in detail. |
| Open Source Code | No | The paper does not explicitly state that the source code for the described methodology is released or provide a link to a code repository. |
| Open Datasets | Yes | The most widely used image captioning training and evaluation dataset is the MSCOCO dataset (Lin et al. 2014) which contains 82,783, 40,504, and 40,775 images with 5 captions each for training, validation, and test, respectively. For offline evaluation, following the Karpathy splits from (Karpathy and Fei-Fei 2015), we use a set of 5K images for validation, 5K images for test and 113,287 images for training. |
| Dataset Splits | Yes | The most widely used image captioning training and evaluation dataset is the MSCOCO dataset (Lin et al. 2014) which contains 82,783, 40,504, and 40,775 images with 5 captions each for training, validation, and test, respectively. For offline evaluation, following the Karpathy splits from (Karpathy and Fei-Fei 2015), we use a set of 5K images for validation, 5K images for test and 113,287 images for training. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used, such as GPU or CPU models. It mentions 'training stage' and 'experimental experience' but no hardware specifications. |
| Software Dependencies | No | The paper mentions 'ADAM (Kingma and Ba 2014) optimizer' but does not provide specific version numbers for any software, libraries, or dependencies used in the experiments. |
| Experiment Setup | Yes | The LSTM hidden dimension for the RNN-based discriminator is 512. The dimension of image CNN feature and word embedding for both CNN-based and RNN-based discriminators is fixed to 2048. We initialize the discriminator via pre-training the model for 10 epochs by minimizing the cross entropy loss in Eq. (12) using the ADAM (Kingma and Ba 2014) optimizer with a batch size of 16, an initial learning rate of 1 10 3 and momentum of 0.9 and 0.999. Similarly, the generator is also pre-trained by MLE for 25 epochs. We use a beam search with a beam size of 5 when validating and tesing. The final optimal hyper-parameters of our proposed algorithm are λ = 0.3, g = 1, d = 1 and Q = CIDEr-D. |