End-to-End Transformer Based Model for Image Captioning

Authors: Yiyu Wang, Jungang Xu, Yingfei Sun2585-2594

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To validate the effectiveness of our proposed model, we conduct experiments on MSCOCO dataset. The experimental results compared to existing published works demonstrate that our model achieves new state-of-the-art performances of 138.2% (single model) and 141.0% (ensemble of 4 models) CIDEr scores on Karpathy offline test split and 136.0% (c5) and 138.3% (c40) CIDEr scores on the official online test server.
Researcher Affiliation Academia Yiyu Wang1, Jungang Xu2*, Yingfei Sun1 1School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences 2School of Computer Science and Technology, University of Chinese Academy of Sciences
Pseudocode No The paper presents architectural diagrams and mathematical formulations but does not include any pseudocode or clearly labeled algorithm blocks.
Open Source Code No Trained models and source code will be released.
Open Datasets Yes We conduct experiments on the MSCOCO 2014 dataset (Lin et al. 2014), which contains 123287 images (82783 for training and 40504 for validation), and each is annotated with 5 reference captions.
Dataset Splits Yes We conduct experiments on the MSCOCO 2014 dataset (Lin et al. 2014), which contains 123287 images (82783 for training and 40504 for validation)... In this paper, we follow the Karpathy split (Karpathy and Fei-Fei 2017) to redivide the MSCOCO, where 113287 images for training, 5000 images for validation and 5000 images for offline evaluation.
Hardware Specification No No specific hardware details (like GPU models, CPU types, or memory amounts) are mentioned in the paper.
Software Dependencies No The paper mentions using the "Adam optimizer" but does not specify any software libraries, frameworks, or their version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes We set the model embedding size D to 512, the number of transformer heads to 8, the number of blocks N for both refining encoder and decoder to 3. For the training process, we first train our model under XE loss LXE for 20 epochs, and set the batch size to 10 and warmup steps to 10,000; then we train our model under LR for another 30 epochs with fixed learning rate of 5 10 6. We adopt Adam (Kingma and Ba 2015) optimizer in both above stages and the beam size is set to 5 in validation and evaluation process.