Dual-level Collaborative Transformer for Image Captioning

Authors: Yunpeng Luo, Jiayi Ji, Xiaoshuai Sun, Liujuan Cao, Yongjian Wu, Feiyue Huang, Chia-Wen Lin, Rongrong Ji2286-2293

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To validate our model, we conduct extensive experiments on the highly competitive MS-COCO dataset, and achieve new state-of-the-art performance on both local and online test sets, i.e., 133.8% CIDEr on Karpathy split and 135.4% CIDEr on the official split.
Researcher Affiliation Collaboration Yunpeng Luo1, Jiayi Ji1, Xiaoshuai Sun1*, Liujuan Cao1, Yongjian Wu3, Feiyue Huang3, Chia-Wen Lin4, Rongrong Ji1,2 1 Media Analytics and Computing Lab, School of Informatics, Xiamen University 2 Institute of Artificial Intelligence, Xiamen University 3 Tencent Youtu Lab 4 National Tsing Hua University
Pseudocode No The paper provides mathematical formulations and descriptions of its components but does not include structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statements about the release of its source code or a link to a code repository.
Open Datasets Yes We conduct our experiments on the benchmark image captioning dataset COCO (Lin et al. 2014).
Dataset Splits Yes For offline evaluation, we follow the widely adopted Karpathy split (Karpathy and Fei-Fei 2015), where 113,287, 5,000, 5,000 images are used for training, validation, and testing respectively.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, or cloud instance specifications) used for running the experiments. It only refers to model architectures like Res Net-101.
Software Dependencies No The paper mentions software components such as 'Adam optimizer' and 'Faster RCNN' but does not provide specific version numbers for any libraries, frameworks, or other software dependencies required to replicate the experiment.
Experiment Setup Yes In our implementation, we set dmodel to 512 and the number of heads to 8. The number of layers for both encoder and decoder is set to 3. In the XE pre-training stage, we warm up our model for 4 epochs with the learning rate linearly increased to 1 10 4. Then we set the learning rate to 1 10 4 between 5 10 epoches, 2 10 6 between 11 12 epoches, 4 10 7 afterwards. The batch size is set to 50. After the 18-epoch XE pre-training stage, we start to optimize our model with CIDEr reward with 5 10 6 learning rate and 100 batch size. We use Adam optimizer in both stages and the beam size is set to 5.