ER-SAN: Enhanced-Adaptive Relation Self-Attention Network for Image Captioning

Authors: Jingyu Li, Zhendong Mao, Shancheng Fang, Hao Li

IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on MS-COCO dataset demonstrate the effectiveness of our ER-SAN, with improvements of CIDEr from 128.6% to 135.3%, achieving state-of-the-art performance.
Researcher Affiliation Academia 1University of Science and Technology of China, Hefei, China 2Huazhong University of Science and Technology, Wuhan, China
Pseudocode No The paper does not contain any structured pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes Codes will be released https://github. com/Crossmodal Group/ER-SAN.
Open Datasets Yes To validate our proposed framework, we conduct extensive experiments on the MS-COCO [Lin et al., 2014] which is the most commonly used dataset for image captioning.
Dataset Splits Yes According to the Karpathy splits [Karpathy and Fei-Fei, 2015], we split 5,000 images are used for validation, 5,000 images for testing, and 113,287 images for training.
Hardware Specification No The paper does not provide specific details about the hardware used for the experiments (e.g., GPU models, CPU types, or memory specifications).
Software Dependencies No The paper mentions using "Adam optimizer" and refers to "Transformer-based model", but does not provide specific version numbers for any software dependencies, libraries, or frameworks used in the implementation.
Experiment Setup Yes Following the Transformer-based model [Vaswani et al., 2017] and [Guo et al., 2020], our encoder and decoder have same layers L, the number of heads is 8, the hidden dimension is 512, the inner dimension of feedforward module is 2048 and dropout = 0.1. We use Adam optimizer with a mini-batch size of 10 to train our model. For cross entropy training, we increase the learning rate linearly to 3e 4 with warm-up for 3 epochs, and then decay by rate 0.5 every 3 epochs. We first train the model for 18 epochs with the cross-entropy loss and then further optimize with CIDEr reward for additional 40 epochs with a fixed learning rate value of 5e 6. The size of beam search is 2 to generate captions during testing. If not specifically specified, we set the baseline transformer model layers L = 4.