Learning Distinct and Representative Modes for Image Captioning

Authors: Qi Chen, Chaorui Deng, Qi Wu

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In the experiments, we apply the proposed DML to two widely used image captioning models, Transformer and Ao ANet. The results show that the learned mode embedding successfully facilitates these models to generate high-quality image captions with different modes, further leading to better performance for both diversity and quality on the MSCOCO dataset1. ... We also find that our models perform surprisingly well under diversity evaluation (using metrics like Self CIDEr [52]) and oracle performance evaluation (on mainstream reference-based metrics like CIDEr [47]), achieving new state-of-the-art results. ... In the experiments, we evaluate the effectiveness of our proposed DML paradigm by applying it to the widely-used Transformer [46] and the state-of-the-art Ao ANet [20], denoted by Transformer DML and Ao ANet-DML, respectively. ... We compare our model with previous So TA methods as well as Beam Search (BS) and show the results in Table 1.
Researcher Affiliation Academia Qi Chen Chaorui Deng Qi Wu Australian Institute for Machine Learning, University of Adelaide {qi.chen04, chaorui.deng, qi.wu01}@adelaide.edu.au
Pseudocode No No explicit pseudocode or algorithm blocks were found in the paper.
Open Source Code Yes Code is available at https://github.com/bladewaltz1/ModeCap
Open Datasets Yes We train and evaluate our method on MSCOCO dataset [28] that contains 123, 287 images and each image is corresponding to at least 5 captions.
Dataset Splits Yes For a fair comparison, we follow the previous works [32, 33] in the area of diverse and controllable image captioning to use the m-RNN split [34] of the COCO dataset, which divides the data into 118, 287, 4, 000 and 1, 000 for training, validation and testing, respectively.
Hardware Specification Yes We train Transformer-DML and Ao ANet-DML on one NVIDIA 3090 GPU with about 10 and 13 GPU hours, respectively.
Software Dependencies No The paper mentions software components like 'Adam W [29] optimizer' and 'Faster RCNN [40]', and 'Transformer [46]', but does not specify their version numbers or versions of other general software dependencies like Python or deep learning frameworks.
Experiment Setup Yes In the Cd VAE branch, the number of transformer layers for Em and Dm is set to 6 and 2, respectively. We use 12 attention heads and a hidden size of 768 for all transformer layers. β in Eq. (2) is set to 0.25, following [45]. The number of mode embeddings in Ωis set to 64 by default. We train the models for 100,000 iterations with a batch of 64 images and all the paired captions. We use Adam W [29] optimizer with a learning rate of 2e-4, and cosine decay it to 0. We use the label smoothing of 0.1 and the gradient clipping threshold of 1.0.