Image Captioning: Transforming Objects into Words
Authors: Simao Herdade, Armin Kappeler, Kofi Boakye, Joao Soares
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Quantitative and qualitative results demonstrate the importance of such geometric attention for image captioning, leading to improvements on all common captioning metrics on the MS-COCO dataset. Code is available at https:// github.com/yahoo/object_relation_transformer. Our best performing model was pre-trained for 30 epochs with a softmax cross-entropy loss using the ADAM optimizer with learning rate defined as in the original Transformer paper, with 20000 warmup steps, and a batch size of 10. We trained for an additional 30 epochs using self-critical reinforcement learning [21] optimizing for CIDEr-D score, and did early-stopping for best performance on the validation set (which contains 5000 images). |
| Researcher Affiliation | Industry | Simao Herdade, Armin Kappeler, KofiBoakye, Joao Soares Yahoo Research San Francisco, CA, 94103 {sherdade,kaboakye,jvbsoares}@verizonmedia.com, akappeler@apple.com |
| Pseudocode | No | The paper describes the algorithm using mathematical equations and text, but does not provide structured pseudocode or an algorithm block. |
| Open Source Code | Yes | Code is available at https:// github.com/yahoo/object_relation_transformer. |
| Open Datasets | Yes | We trained and evaluated our algorithm on the Microsoft COCO (MS-COCO) 2014 Captions dataset [14]. The dataset contains 113K training images with 5 human annotated captions for each image. |
| Dataset Splits | Yes | We report results on the Karpathy validation and test splits [11], which are commonly used in other image captioning publications. The dataset contains 113K training images with 5 human annotated captions for each image. The Karpathy test and validation sets contain 5K images each. |
| Hardware Specification | Yes | We ran our experiments on NVIDIA Tesla V100 GPUs. |
| Software Dependencies | No | The paper mentions "Our algorithm was developed in Py Torch" but does not specify a version number for PyTorch or any other software dependencies with their versions. |
| Experiment Setup | Yes | Our best performing model was pre-trained for 30 epochs with a softmax cross-entropy loss using the ADAM optimizer with learning rate defined as in the original Transformer paper, with 20000 warmup steps, and a batch size of 10. We trained for an additional 30 epochs using self-critical reinforcement learning [21] optimizing for CIDEr-D score, and did early-stopping for best performance on the validation set (which contains 5000 images). The models compared in sections 5.3-5.6 are evaluated after training for 30 epochs with standard cross-entropy loss, using ADAM optimization with the above learning rate schedule, and with batch size 15. |