Object Relation Attention for Image Paragraph Captioning
Authors: Li-Chuan Yang, Chih-Yuan Yang, Jane Yung-jen Hsu3136-3144
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that the proposed network extracts effective object features for image paragraph captioning and achieves promising performance against existing methods. We evaluate the proposed method on the Stanford paragraph dataset (Krause et al. 2017), which contains 19551 image/paragraph pairs, split into training/validation/test sets containing 14575/2487/2489 pairs, respectively. We use 6 metrics CIDEr (Vedantam, Zitnick, and Parikh 2015), METEOR (Banerjee and Lavie 2005), BLEU-1, BLEU-2, BLEU-3, and BLEU-4 (Papineni et al. 2002) as the literature (Krause et al. 2017; Liang et al. 2017; Chatterjee and Schwing 2018; Melas-Kyriazi, Rush, and Han 2018). |
| Researcher Affiliation | Academia | Li-Chuan Yang,1 Chih-Yuan Yang,1,2 and Jane Yung-jen Hsu1,2 1Computer Science and Information Engineering, National Taiwan University 2NTU Io X Center, National Taiwan University {r07922100,yangchihyuan,yjhsu}@ntu.edu.tw |
| Pseudocode | No | No explicit pseudocode or algorithm blocks found. The paper describes the architecture and steps in text and flowcharts but not as structured pseudocode. |
| Open Source Code | No | No statement about releasing their own source code or link to a repository for the described methodology. The acknowledgement section mentions appreciation for 'open-source implementations' by others, not their own. |
| Open Datasets | Yes | We evaluate the proposed method on the Stanford paragraph dataset (Krause et al. 2017), which contains 19551 image/paragraph pairs, split into training/validation/test sets containing 14575/2487/2489 pairs, respectively. |
| Dataset Splits | Yes | We evaluate the proposed method on the Stanford paragraph dataset (Krause et al. 2017), which contains 19551 image/paragraph pairs, split into training/validation/test sets containing 14575/2487/2489 pairs, respectively. |
| Hardware Specification | Yes | We train our model on a machine equipped with a 3.7GHz 12-core CPU and an NVidia GPU GTX 1080Ti. |
| Software Dependencies | No | The paper mentions using a "publicly available Faster R-CNN implementation" and the "Adam optimizer" but does not provide specific version numbers for any software, libraries, or dependencies. |
| Experiment Setup | Yes | To train our models, we use the Adam optimizer with a learning rate initialized as 5 10 4 and decaying 20% every two epochs. We manually set the attention hyperparameter c as 2 because we find the proposed method converges well and performs stably when the value is between 1 to 3. We set the training batch size as 10. The configuration of overlapping objects and asymmetric features consumes 2.3 GB GPU memory and takes 16 hours to run 80 epochs, including the first 30 cross-entropy epochs and the following 50 SCST epochs. |