Interactive Dual Generative Adversarial Networks for Image Captioning
Authors: Junhao Liu, Kai Wang, Chunpu Xu, Zhou Zhao, Ruifeng Xu, Ying Shen, Min Yang11588-11595
AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on MSCOCO dataset demonstrate that the proposed IDGAN model significantly outperforms the compared methods for image captioning. 6 Experimental Results 6.1 Quantitative Evaluation 6.2 Human Evaluation 6.3 Case Study 6.4 Error Analysis |
| Researcher Affiliation | Academia | Junhao Liu,1,2 Kai Wang,1 Chunpu Xu,3 Zhou Zhao,4 Ruifeng Xu,5 Ying Shen,6 Min Yang1 1Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences 2University of Chinese Academy of Sciences 3Huazhong University of Science and Technology 4Zhejiang University 5Harbin Institute of Technology (Shenzhen) 6Peking University Shenzhen Graduate School {jh.liu, kai.wang, min.yang}@siat.ac.cn, cpx@hust.edu.cn, zhaozhou@zju.edu.cn xuruifeng@hit.edu.cn, shenying@pkusz.edu.cn |
| Pseudocode | No | The paper describes algorithms and processes but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described, such as a repository link or an explicit code release statement. |
| Open Datasets | Yes | Dataset We adopt the widely used MSCOCO 2014 (denoted as MSCOCO) image captions dataset (Karpathy and Fei-Fei 2015) as the experimental data. In total, MSCOCO is composed of 82,783 training images, 40,504 validation images, and 40,775 testing images. For the off-line testing, we use the Karpathy split setting (Karpathy and Fei-Fei 2015), which has been widely adopted in previous studies. There are 113,287 images for training, 5,000 images for validation, and 5,000 images for testing. |
| Dataset Splits | Yes | In total, MSCOCO is composed of 82,783 training images, 40,504 validation images, and 40,775 testing images. For the off-line testing, we use the Karpathy split setting (Karpathy and Fei-Fei 2015), which has been widely adopted in previous studies. There are 113,287 images for training, 5,000 images for validation, and 5,000 images for testing. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. It only mentions using Faster R-CNN. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers). |
| Experiment Setup | Yes | The number of hidden units in LSTM caption encoder is set to 512. The parameters of the LSTM networks are initialized with normal distribution N(0, 0.01), and the other parameters are initialized by using the uniform distribution [-0.01, 0.01]. We set the number of hidden units in Top Down attention LSTM (LSTM(1)) and language model LSTM (LSTM(2)) to 1,024. The numbers of hidden units of LSTMs used in Eq. (14) and Eq. (17) are set to 512. During adversarial training, the two generators (Gθ1 and Gθ2) produce M1 = M2 = 5 candidate captions. The value of γ1, γ2, γ3 equal to 0.2, 1, 0.8 respectively. We pre-train Gθ1 for 30 epochs with maximum likelihood and Dφ2 for 5 epochs with triplet loss. After that, we optimize the whole model with interactive adversarial training for 30 epochs. |