reproducibilityindex.ai

Visual Clues: Bridging Vision and Language Foundations for Image Paragraph Captioning

Authors: Yujia Xie, Luowei Zhou, Xiyang Dai, Lu Yuan, Nguyen Bach, Ce Liu, Michael Zeng

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the quality of generated descriptions by quantitative and qualitative measurement. The results demonstrate the effectiveness of such a structured semantic representation. In this section, we use SPIPE to benchmark the accuracy and completeness of our framework. We evaluate our framework on the test set of Stanford dataset (Krause et al., 2017).
Researcher Affiliation	Industry	Yujia Xie, Luowei Zhou , Xiyang Dai, Lu Yuan, Nguyen Bach, Ce Liu, Michael Zeng Microsoft {yujiaxie, luowei.zhou, xiyang.dai, luyuan, nguyenbach, ce.liu, nzeng}microsoft.com Currently at Google Brain.
Pseudocode	No	The paper describes the framework in steps but does not include a structured pseudocode or algorithm block.
Open Source Code	No	The paper states 'Some of the models and data are proprietary.' in the reproducibility checklist, and provides a link to a baseline model ('Socratic model') but does not explicitly state that their own source code is open or provide a link to it.
Open Datasets	Yes	We evaluate our framework on the test set of Stanford dataset (Krause et al., 2017). The dataset is a subset of Visual Genome (VG) dataset (Krishna et al., 2017), and therefore we can obtain the human-annotated scene graphs from VG as well. BLIP-large (Li et al., 2022) finetuned on COCO captions dataset (Chen et al., 2015) We benchmark its performance in two Visual Question Answering (VQA) datasets we use the GQA (Hudson and Manning, 2019) dataset for probing the capability of scene understanding, and the OK-VQA (Marino et al., 2019) dataset for the awareness of the commonsense knowledge.
Dataset Splits	No	The paper mentions using the 'test set of Stanford dataset' and 'training images similar to Stanford dataset', but does not explicitly provide the training/validation/test dataset splits with specific percentages or counts for their experimental setup.
Hardware Specification	No	The paper states 'The proposed method does not involve training.' and 'The overall usage of computing resources is not significant.', and does not specify any hardware details like GPU/CPU models.
Software Dependencies	No	The paper mentions specific models like GPT-3 Davinci-text-001, BLIP-large, Florence-H, and YOLOv5, but does not specify software dependencies with version numbers (e.g., Python, PyTorch, or CUDA versions).
Experiment Setup	Yes	We adopt temperature as 0.8, as a higher temperature encourages the model to have more creative outputs. We adopt the frequency penalty as 0.5 and the maximum number of tokens as 100. We adopt number of tags M = 5, thresholds β = γ = 0.2, and number of candidates K = 40.