Image Captioning with Context-Aware Auxiliary Guidance

Authors: Zeliang Song, Xiaofei Zhou, Zhendong Mao, Jianlong Tan2584-2592

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments and analyses on the challenging Microsoft COCO image captioning benchmark to evaluate our proposed method. To validate the adaptability of our method, we apply CAAG to three popular image captioners (Att2all (2017), Up-Down (2018) and Ao ANet (2019)), and our model achieves consistent improvements over all metrics.
Researcher Affiliation Academia Zeliang Song,1,2 Xiaofei Zhou,1,2 Zhendong Mao,3 Jianlong Tan1,2 1Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China 2School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China 3University of Science and Technology of China, Hefei, China {songzeliang, zhouxiaofei}@iie.ac.cn, zdmao@ustc.edu.cn
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide a concrete link to the source code for the methodology described.
Open Datasets Yes Visual Genome dataset (Krishna et al. 2017) contains 108,077 images... Microsoft COCO (MSCOCO) 2014 captions dataset (Lin et al. 2014). MSCOCO contains totally 164,062 images...
Dataset Splits Yes Visual Genome dataset... is split with 98K / 5K / 5K images for training/validation/testing... For hyperparameters selection and offline evaluation, we use the publicly available Karpathy split which contains 113,287 training images, and 5K images respectively for validation and testing. For online evaluation on the MSCOCO test server, we add the 5K testing set into the training set to form a larger training set (118,287 images).
Hardware Specification No The paper does not provide specific hardware details such as GPU or CPU models used for running its experiments.
Software Dependencies No The paper mentions software components like Faster R-CNN and ADAM optimizer, but does not provide specific version numbers for any software dependencies or libraries.
Experiment Setup Yes Implementation Details We employ Up-Down captioner as our primary network, so we use the same hyperparameters proposed in (Anderson et al. 2018) for fair comparison. The Faster R-CNN implementation uses an Io U threshold of 0.7 for region proposal suppression, 0.3 for object class suppression, and a class detection confidence threshold of 0.2 for selecting salient image regions. For captioning model, we set the dimension of hidden states in both LSTMs to 1000, the number of hidden units in all attention layers to 512, and the dimension of input word embedding to 1000. The trade-off coefficient in Eq. 8 is set to 0.5 and the batch size is 64. We use beam search with a beam size of 3 to generate captions when validating and testing. We employ ADAM optimizer with an initial learning rate of 5e 4, and momentum of 0.9 and 0.999. We evaluate the model on the validation set at every epoch and select the model with highest CIDEr score as the initialization for reinforcement learning. For self-critical learning, we select CIDEr score as our reward function. The learning rate starts from 5e 5 and decays by rate 0.1 every 50 epochs.