S2 Transformer for Image Captioning
Authors: Pengpeng Zeng, Haonan Zhang, Jingkuan Song, Lianli Gao
IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on the MSCOCO benchmark demonstrate that our method achieves new state-of-art performance without bringing excessive parameters compared with the vanilla transformer. |
| Researcher Affiliation | Academia | School of Computer Science and Engineering and Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China, Chengdu, China. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The source code is available at https://github.com/zchoi/S2Transformer. |
| Open Datasets | Yes | We conduct experiments to verify the effectiveness of our proposed S2 Transformer on commonlyused image captioning dataset, i.e., MS-COCO. |
| Dataset Splits | Yes | In offline testing, we follow the setting in [Karpathy and Fei-Fei, 2015], where 113,287 images, 5,000 images, and 5,000 images are used as train, validation, and test set, respectively. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers. |
| Experiment Setup | Yes | In practice, our encoder and decoder both have 3 layers, where each layer uses 8 self-attention heads and the inner dimension of FFN is 2,048. The number of cluster centers N is 5 and the hyper-parameter λ = 0.2 in Eq. 9. We employ Adam optimizer to train all models and set batch size as 50. For cross-entropy (CE) training, we set the minimum epoch as 15... For selfcritical sequence training, the learning rate is fixed at 5e-7. |