reproducibilityindex.ai

Consensus Graph Representation Learning for Better Grounded Image Captioning

Authors: Wenqiao Zhang, Haochen Shi, Siliang Tang, Jun Xiao, Qiang Yu, Yueting Zhuang3394-3402

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate the effectiveness of our model, with a signiﬁcant decline in object hallucination (-9% CHAIRi) on the Flickr30k Entities dataset. Besides, our CGRL also evaluated by several automatic metrics and human evaluation, the results indicate that the proposed approach can simultaneously improve the performance of image captioning (+2.9 Cider) and grounding (+2.3 F1LOC).
Researcher Affiliation	Collaboration	Wenqiao Zhang1 , Haochen Shi1 Siliang Tang1 , Jun Xiao1, Qiang Yu2, Yueting Zhuang1 1 Zhejiang University, 2 Citycloud Technology
Pseudocode	Yes	Algorithm 1 (see in appendix) details the pseudocode of our CGRL algorithm for GIC.
Open Source Code	No	The paper does not provide an explicit statement about releasing its source code, nor does it include a direct link to a code repository for its methodology.
Open Datasets	Yes	We benchmark our approach for GIC on the Flickr30k Entities dataset and compare our CRGL method to the state-of-the-art models. Moreover, the Flickr30k Entities collected 31k images with 275k bounding boxes with associated with natural language phrases. [...] 1https://hockenmaier.cs.illinois.edu/Denotation Graph/
Dataset Splits	Yes	There are 290k images for training, 1k images for validation, and another 1k images for testing
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory used for running its experiments.
Software Dependencies	No	The paper mentions models and networks (e.g., Faster R-CNN, Res Ne Xt-101, GCN) but does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	For image captioning, we tokenized the texts on white space, and the sentences are cut at a maximum length of 20 words. All the Arabic numerals are converted to the English word. We add a Unknown token to replace the words out of the vocabulary list. The vocabulary has 7,000 words, and each word is represented by a 512-dimensional vector, the RNN encoding size m = 1024.