Consensus Graph Representation Learning for Better Grounded Image Captioning

Authors: Wenqiao Zhang, Haochen Shi, Siliang Tang, Jun Xiao, Qiang Yu, Yueting Zhuang3394-3402

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate the effectiveness of our model, with a significant decline in object hallucination (-9% CHAIRi) on the Flickr30k Entities dataset. Besides, our CGRL also evaluated by several automatic metrics and human evaluation, the results indicate that the proposed approach can simultaneously improve the performance of image captioning (+2.9 Cider) and grounding (+2.3 F1LOC).
Researcher Affiliation Collaboration Wenqiao Zhang1 , Haochen Shi1 Siliang Tang1 , Jun Xiao1, Qiang Yu2, Yueting Zhuang1 1 Zhejiang University, 2 Citycloud Technology
Pseudocode Yes Algorithm 1 (see in appendix) details the pseudocode of our CGRL algorithm for GIC.
Open Source Code No The paper does not provide an explicit statement about releasing its source code, nor does it include a direct link to a code repository for its methodology.
Open Datasets Yes We benchmark our approach for GIC on the Flickr30k Entities dataset and compare our CRGL method to the state-of-the-art models. Moreover, the Flickr30k Entities collected 31k images with 275k bounding boxes with associated with natural language phrases. [...] 1https://hockenmaier.cs.illinois.edu/Denotation Graph/
Dataset Splits Yes There are 290k images for training, 1k images for validation, and another 1k images for testing
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory used for running its experiments.
Software Dependencies No The paper mentions models and networks (e.g., Faster R-CNN, Res Ne Xt-101, GCN) but does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes For image captioning, we tokenized the texts on white space, and the sentences are cut at a maximum length of 20 words. All the Arabic numerals are converted to the English word. We add a Unknown token to replace the words out of the vocabulary list. The vocabulary has 7,000 words, and each word is represented by a 512-dimensional vector, the RNN encoding size m = 1024.