Object Relation Attention for Image Paragraph Captioning

Authors: Li-Chuan Yang, Chih-Yuan Yang, Jane Yung-jen Hsu3136-3144

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that the proposed network extracts effective object features for image paragraph captioning and achieves promising performance against existing methods. We evaluate the proposed method on the Stanford paragraph dataset (Krause et al. 2017), which contains 19551 image/paragraph pairs, split into training/validation/test sets containing 14575/2487/2489 pairs, respectively. We use 6 metrics CIDEr (Vedantam, Zitnick, and Parikh 2015), METEOR (Banerjee and Lavie 2005), BLEU-1, BLEU-2, BLEU-3, and BLEU-4 (Papineni et al. 2002) as the literature (Krause et al. 2017; Liang et al. 2017; Chatterjee and Schwing 2018; Melas-Kyriazi, Rush, and Han 2018).
Researcher Affiliation Academia Li-Chuan Yang,1 Chih-Yuan Yang,1,2 and Jane Yung-jen Hsu1,2 1Computer Science and Information Engineering, National Taiwan University 2NTU Io X Center, National Taiwan University {r07922100,yangchihyuan,yjhsu}@ntu.edu.tw
Pseudocode No No explicit pseudocode or algorithm blocks found. The paper describes the architecture and steps in text and flowcharts but not as structured pseudocode.
Open Source Code No No statement about releasing their own source code or link to a repository for the described methodology. The acknowledgement section mentions appreciation for 'open-source implementations' by others, not their own.
Open Datasets Yes We evaluate the proposed method on the Stanford paragraph dataset (Krause et al. 2017), which contains 19551 image/paragraph pairs, split into training/validation/test sets containing 14575/2487/2489 pairs, respectively.
Dataset Splits Yes We evaluate the proposed method on the Stanford paragraph dataset (Krause et al. 2017), which contains 19551 image/paragraph pairs, split into training/validation/test sets containing 14575/2487/2489 pairs, respectively.
Hardware Specification Yes We train our model on a machine equipped with a 3.7GHz 12-core CPU and an NVidia GPU GTX 1080Ti.
Software Dependencies No The paper mentions using a "publicly available Faster R-CNN implementation" and the "Adam optimizer" but does not provide specific version numbers for any software, libraries, or dependencies.
Experiment Setup Yes To train our models, we use the Adam optimizer with a learning rate initialized as 5 10 4 and decaying 20% every two epochs. We manually set the attention hyperparameter c as 2 because we find the proposed method converges well and performs stably when the value is between 1 to 3. We set the training batch size as 10. The configuration of overlapping objects and asymmetric features consumes 2.3 GB GPU memory and takes 16 hours to run 80 epochs, including the first 30 cross-entropy epochs and the following 50 SCST epochs.