Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

ER-SAN: Enhanced-Adaptive Relation Self-Attention Network for Image Captioning

Authors: Jingyu Li, Zhendong Mao, Shancheng Fang, Hao Li

IJCAI 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on MS-COCO dataset demonstrate the effectiveness of our ER-SAN, with improvements of CIDEr from 128.6% to 135.3%, achieving state-of-the-art performance.
Researcher Affiliation	Academia	1University of Science and Technology of China, Hefei, China 2Huazhong University of Science and Technology, Wuhan, China
Pseudocode	No	The paper does not contain any structured pseudocode or clearly labeled algorithm blocks.
Open Source Code	Yes	Codes will be released https://github. com/Crossmodal Group/ER-SAN.
Open Datasets	Yes	To validate our proposed framework, we conduct extensive experiments on the MS-COCO [Lin et al., 2014] which is the most commonly used dataset for image captioning.
Dataset Splits	Yes	According to the Karpathy splits [Karpathy and Fei-Fei, 2015], we split 5,000 images are used for validation, 5,000 images for testing, and 113,287 images for training.
Hardware Specification	No	The paper does not provide specific details about the hardware used for the experiments (e.g., GPU models, CPU types, or memory specifications).
Software Dependencies	No	The paper mentions using "Adam optimizer" and refers to "Transformer-based model", but does not provide specific version numbers for any software dependencies, libraries, or frameworks used in the implementation.
Experiment Setup	Yes	Following the Transformer-based model [Vaswani et al., 2017] and [Guo et al., 2020], our encoder and decoder have same layers L, the number of heads is 8, the hidden dimension is 512, the inner dimension of feedforward module is 2048 and dropout = 0.1. We use Adam optimizer with a mini-batch size of 10 to train our model. For cross entropy training, we increase the learning rate linearly to 3e 4 with warm-up for 3 epochs, and then decay by rate 0.5 every 3 epochs. We first train the model for 18 epochs with the cross-entropy loss and then further optimize with CIDEr reward for additional 40 epochs with a fixed learning rate value of 5e 6. The size of beam search is 2 to generate captions during testing. If not specifically specified, we set the baseline transformer model layers L = 4.