Learning Distinct and Representative Modes for Image Captioning
Authors: Qi Chen, Chaorui Deng, Qi Wu
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In the experiments, we apply the proposed DML to two widely used image captioning models, Transformer and Ao ANet. The results show that the learned mode embedding successfully facilitates these models to generate high-quality image captions with different modes, further leading to better performance for both diversity and quality on the MSCOCO dataset1. ... We also find that our models perform surprisingly well under diversity evaluation (using metrics like Self CIDEr [52]) and oracle performance evaluation (on mainstream reference-based metrics like CIDEr [47]), achieving new state-of-the-art results. ... In the experiments, we evaluate the effectiveness of our proposed DML paradigm by applying it to the widely-used Transformer [46] and the state-of-the-art Ao ANet [20], denoted by Transformer DML and Ao ANet-DML, respectively. ... We compare our model with previous So TA methods as well as Beam Search (BS) and show the results in Table 1. |
| Researcher Affiliation | Academia | Qi Chen Chaorui Deng Qi Wu Australian Institute for Machine Learning, University of Adelaide {qi.chen04, chaorui.deng, qi.wu01}@adelaide.edu.au |
| Pseudocode | No | No explicit pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | Yes | Code is available at https://github.com/bladewaltz1/ModeCap |
| Open Datasets | Yes | We train and evaluate our method on MSCOCO dataset [28] that contains 123, 287 images and each image is corresponding to at least 5 captions. |
| Dataset Splits | Yes | For a fair comparison, we follow the previous works [32, 33] in the area of diverse and controllable image captioning to use the m-RNN split [34] of the COCO dataset, which divides the data into 118, 287, 4, 000 and 1, 000 for training, validation and testing, respectively. |
| Hardware Specification | Yes | We train Transformer-DML and Ao ANet-DML on one NVIDIA 3090 GPU with about 10 and 13 GPU hours, respectively. |
| Software Dependencies | No | The paper mentions software components like 'Adam W [29] optimizer' and 'Faster RCNN [40]', and 'Transformer [46]', but does not specify their version numbers or versions of other general software dependencies like Python or deep learning frameworks. |
| Experiment Setup | Yes | In the Cd VAE branch, the number of transformer layers for Em and Dm is set to 6 and 2, respectively. We use 12 attention heads and a hidden size of 768 for all transformer layers. β in Eq. (2) is set to 0.25, following [45]. The number of mode embeddings in Ωis set to 64 by default. We train the models for 100,000 iterations with a batch of 64 images and all the paired captions. We use Adam W [29] optimizer with a learning rate of 2e-4, and cosine decay it to 0. We use the label smoothing of 0.1 and the gradient clipping threshold of 1.0. |