S2 Transformer for Image Captioning

Authors: Pengpeng Zeng, Haonan Zhang, Jingkuan Song, Lianli Gao

IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on the MSCOCO benchmark demonstrate that our method achieves new state-of-art performance without bringing excessive parameters compared with the vanilla transformer.
Researcher Affiliation Academia School of Computer Science and Engineering and Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China, Chengdu, China.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes The source code is available at https://github.com/zchoi/S2Transformer.
Open Datasets Yes We conduct experiments to verify the effectiveness of our proposed S2 Transformer on commonlyused image captioning dataset, i.e., MS-COCO.
Dataset Splits Yes In offline testing, we follow the setting in [Karpathy and Fei-Fei, 2015], where 113,287 images, 5,000 images, and 5,000 images are used as train, validation, and test set, respectively.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers.
Experiment Setup Yes In practice, our encoder and decoder both have 3 layers, where each layer uses 8 self-attention heads and the inner dimension of FFN is 2,048. The number of cluster centers N is 5 and the hyper-parameter λ = 0.2 in Eq. 9. We employ Adam optimizer to train all models and set batch size as 50. For cross-entropy (CE) training, we set the minimum epoch as 15... For selfcritical sequence training, the learning rate is fixed at 5e-7.