reproducibilityindex.ai

Evaluating and Analyzing Relationship Hallucinations in Large Vision-Language Models

Authors: Mingrui Wu, Jiayi Ji, Oucheng Huang, Jiale Li, Yuhang Wu, Xiaoshuai Sun, Rongrong Ji

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we introduce R-Bench, a novel benchmark for evaluating Vision Relationship Hallucination. R-Bench features image-level questions that focus on the existence of relationships and instance-level questions that assess local visual comprehension. We identify three types of relationship co-occurrences that lead to hallucinations: relationship-relationship, subject-relationship, and relationship-object. The visual instruction tuning dataset s long-tail distribution significantly impacts LVLMs understanding of visual relationships. Furthermore, our analysis reveals that current LVLMs tend to disregard visual content and overly rely on the common sense knowledge of Large Language Models. They also struggle with reasoning about spatial relationships based on contextual information. We assess various recently popular LVLMs on our R-Bench and present our findings as follows: 1) Relationship hallucinations in LVLMs are more severe than object hallucinations, mainly due to the long-tail distribution between relationships and objects in the training data. 2) Relationship hallucinations often emerge from the co-occurrence patterns among relationships, specifically relationship-relationship, subject-relationship, and relationship-object. 3) Employing fine-grained image-text alignment could potentially mitigate hallucinations. Additionally, our analysis of counterfactual and illusion relationship hallucinations, based on web-collected images, reveals: 4) Existing LVLMs often overlook visual content, relying on LLM s common sense for predictions. 5) LVLMs struggle to reason about spatial relationships based on context.
Researcher Affiliation	Academia	1Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China. Correspondence to: Jiayi Ji <jjyxmu@gmail.com>.
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide an explicit statement or link for open-sourcing the code for the described methodology or benchmark.
Open Datasets	Yes	For the benchmark, we employ a combination of automatic generation by the Large Language Model (LLM) and manual curation. To ensure the benchmark s integrity, it is based on the nocaps validation set (Agrawal et al., 2019), preventing overlap with the pre-trained data of LVLMs. Specifically, we first apply a scene graph parser (Li et al., 2023f) to extract relationship triplets from COCO captions.
Dataset Splits	No	The paper evaluates existing LVLMs on a newly proposed benchmark (R-Bench) derived from the nocaps validation set. It describes how subsets of this benchmark are created for evaluation ("randomly select an equal number of both to create a subset for LVLMs evaluation" and "compute average scores of 5 random subsets"), but it does not specify traditional train/validation/test splits for training a model presented in the paper.
Hardware Specification	Yes	All models are evaluated on 1 NVIDIA RTX-3090 with 24G memory.
Software Dependencies	No	The paper mentions specific LLM models and other components like "Llama2-chat-13B" and "Grounding DINO" but does not provide specific version numbers for software dependencies such as programming languages, libraries, or frameworks (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	For each setting, we randomly sample 5 subsets with a 1:1 positive-negative question ratio and compute the average scores of these 5 subsets. During inference, we perform low-bit quantization on the LLM to avoid the out of GPU memory.