reproducibilityindex.ai

Unveiling the Tapestry of Consistency in Large Vision-Language Models

Authors: Yuan Zhang, Fei xiao, Tao Huang, Chun-Kai Fan, Hongyuan Dong, Jiawen Li, Jiacong Wang, Kuan Cheng, Shanghang Zhang, Haoyuan Guo

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To this end, we provide a multi-modal benchmark Con Bench, to intuitively analyze how LVLMs perform when the solution space of a prompt revolves around a knowledge point. Based on the Con Bench tool, we are the first to reveal the tapestry and get the following findings: (1) In the discriminate realm, the larger the solution space of the prompt, the lower the accuracy of the answers. (2) Establish the relationship between the discriminative and generative realms: the accuracy of the discriminative question type exhibits a strong positive correlation with its Consistency with the caption. (3) Compared to open-source models, closed-source models exhibit a pronounced bias advantage in terms of Consistency. Eventually, we ameliorate the consistency of LVLMs by trigger-based diagnostic refinement, indirectly improving the performance of their caption. We hope this paper will accelerate the research community in better evaluating their models and encourage future advancements in the consistency domain.
Researcher Affiliation	Collaboration	1 School of Computer Science, Peking University 2 Byte Dance Inc 3 The University of Sydney 4 School of Artificial Intelligence, UCAS
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks. Figure 8 illustrates a pipeline, but it is not pseudocode.
Open Source Code	Yes	https://github.com/foundation-multimodal-models/Con Bench
Open Datasets	Yes	We manually chose 1K images from four high-quality multimodal benchmarks: MME [10], Seed Bench [12], MMBench [21], and MMMU [28]... We have uploaded the Con Bench dataset, including images and their prompts, to the Hugging Face platform. The dataset can be accessed at the following URL: https://huggingface.co/datasets/Con Bench/Con Bench.
Dataset Splits	No	The paper evaluates existing LVLMs on a newly created benchmark (Con Bench) but does not provide details about training/validation/test splits of a dataset used to train the models themselves. It describes the structure of the benchmark data (e.g., true/false questions have 50% distribution), which pertains to the evaluation set, not model training splits.
Hardware Specification	Yes	The evaluation in our paper only needs an A100-80GB GPU.
Software Dependencies	No	The paper mentions using GPT/GPT-4 for certain tasks (e.g., prompt generation, judging consistency), but it does not specify version numbers for programming languages, libraries, or other software components used for their own implementation or experiments.
Experiment Setup	Yes	When parsed result s > 0.4, we consider the answer to be exactly right. ... we set τ = 0.85 here ... We carried out experiments on the LLa VANe XT-34B and Mini Gemini-34B and evaluated them on the metric[C] of Con Bench.