Visual Clues: Bridging Vision and Language Foundations for Image Paragraph Captioning
Authors: Yujia Xie, Luowei Zhou, Xiyang Dai, Lu Yuan, Nguyen Bach, Ce Liu, Michael Zeng
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the quality of generated descriptions by quantitative and qualitative measurement. The results demonstrate the effectiveness of such a structured semantic representation. In this section, we use SPIPE to benchmark the accuracy and completeness of our framework. We evaluate our framework on the test set of Stanford dataset (Krause et al., 2017). |
| Researcher Affiliation | Industry | Yujia Xie, Luowei Zhou , Xiyang Dai, Lu Yuan, Nguyen Bach, Ce Liu, Michael Zeng Microsoft {yujiaxie, luowei.zhou, xiyang.dai, luyuan, nguyenbach, ce.liu, nzeng}microsoft.com Currently at Google Brain. |
| Pseudocode | No | The paper describes the framework in steps but does not include a structured pseudocode or algorithm block. |
| Open Source Code | No | The paper states 'Some of the models and data are proprietary.' in the reproducibility checklist, and provides a link to a baseline model ('Socratic model') but does not explicitly state that their own source code is open or provide a link to it. |
| Open Datasets | Yes | We evaluate our framework on the test set of Stanford dataset (Krause et al., 2017). The dataset is a subset of Visual Genome (VG) dataset (Krishna et al., 2017), and therefore we can obtain the human-annotated scene graphs from VG as well. BLIP-large (Li et al., 2022) finetuned on COCO captions dataset (Chen et al., 2015) We benchmark its performance in two Visual Question Answering (VQA) datasets we use the GQA (Hudson and Manning, 2019) dataset for probing the capability of scene understanding, and the OK-VQA (Marino et al., 2019) dataset for the awareness of the commonsense knowledge. |
| Dataset Splits | No | The paper mentions using the 'test set of Stanford dataset' and 'training images similar to Stanford dataset', but does not explicitly provide the training/validation/test dataset splits with specific percentages or counts for their experimental setup. |
| Hardware Specification | No | The paper states 'The proposed method does not involve training.' and 'The overall usage of computing resources is not significant.', and does not specify any hardware details like GPU/CPU models. |
| Software Dependencies | No | The paper mentions specific models like GPT-3 Davinci-text-001, BLIP-large, Florence-H, and YOLOv5, but does not specify software dependencies with version numbers (e.g., Python, PyTorch, or CUDA versions). |
| Experiment Setup | Yes | We adopt temperature as 0.8, as a higher temperature encourages the model to have more creative outputs. We adopt the frequency penalty as 0.5 and the maximum number of tokens as 100. We adopt number of tags M = 5, thresholds β = γ = 0.2, and number of candidates K = 40. |