Attention-Aligned Transformer for Image Captioning

Authors: Zhengcong Fei607-615

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments conducted on the MS COCO dataset demonstrate that the proposed A2 Transformer consistently outperforms baselines in both automatic metrics and human evaluation.
Researcher Affiliation Academia 1Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China 2University of Chinese Academy of Sciences, Beijing 100049, China feizhengcong@ict.ac.cn
Pseudocode No The paper describes methods using natural language and mathematical equations but does not include any explicit pseudocode or algorithm blocks.
Open Source Code Yes Trained models and code for reproducing the experiments are publicly available.
Open Datasets Yes All the experiments are conducted on the most popular image captioning dataset MS COCO (Chen et al. 2015).
Dataset Splits Yes We follow the common practice as Karpathy splits (Karpathy and Fei-Fei 2015) for validation of model hyperparameters and offline evaluation. This split contains 113,287 images for training and 5,000 respectively for validation and test.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory used for running the experiments.
Software Dependencies No The paper does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions) that are needed to replicate the experiments.
Experiment Setup Yes We set the dimensionality d of each layer to 512 and the number of heads to 8. We employ a dropout rate of 0.1 after each attention and feed-forward layer. Model is first trained to minimize the negative log-likelihood of the training data following the learning rate scheduling strategy with a warmup equal to 10,00, and then fine-tuned with the CIDEr score using Reinforcement Learning (Rennie et al. 2017) with a fixed learning rate of 5 10 6. We train all models using the Adam optimizer (Kingma and Ba 2014), a batch size of 50 and a beam size of 5. We set the hyperparameter η = 0.1 in Equation 7 in all experiments.