ER-SAN: Enhanced-Adaptive Relation Self-Attention Network for Image Captioning
Authors: Jingyu Li, Zhendong Mao, Shancheng Fang, Hao Li
IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on MS-COCO dataset demonstrate the effectiveness of our ER-SAN, with improvements of CIDEr from 128.6% to 135.3%, achieving state-of-the-art performance. |
| Researcher Affiliation | Academia | 1University of Science and Technology of China, Hefei, China 2Huazhong University of Science and Technology, Wuhan, China |
| Pseudocode | No | The paper does not contain any structured pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | Codes will be released https://github. com/Crossmodal Group/ER-SAN. |
| Open Datasets | Yes | To validate our proposed framework, we conduct extensive experiments on the MS-COCO [Lin et al., 2014] which is the most commonly used dataset for image captioning. |
| Dataset Splits | Yes | According to the Karpathy splits [Karpathy and Fei-Fei, 2015], we split 5,000 images are used for validation, 5,000 images for testing, and 113,287 images for training. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for the experiments (e.g., GPU models, CPU types, or memory specifications). |
| Software Dependencies | No | The paper mentions using "Adam optimizer" and refers to "Transformer-based model", but does not provide specific version numbers for any software dependencies, libraries, or frameworks used in the implementation. |
| Experiment Setup | Yes | Following the Transformer-based model [Vaswani et al., 2017] and [Guo et al., 2020], our encoder and decoder have same layers L, the number of heads is 8, the hidden dimension is 512, the inner dimension of feedforward module is 2048 and dropout = 0.1. We use Adam optimizer with a mini-batch size of 10 to train our model. For cross entropy training, we increase the learning rate linearly to 3e 4 with warm-up for 3 epochs, and then decay by rate 0.5 every 3 epochs. We first train the model for 18 epochs with the cross-entropy loss and then further optimize with CIDEr reward for additional 40 epochs with a fixed learning rate value of 5e 6. The size of beam search is 2 to generate captions during testing. If not specifically specified, we set the baseline transformer model layers L = 4. |