Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Unified Triplet-Level Hallucination Evaluation for Large Vision-Language Models
Authors: Junjie Wu, Tsz Ting Chung, Kai Chen, Dit-Yan Yeung
TMLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Based on our framework, we further introduce Tri-HE, a novel Triplet-level Hallucination Evaluation benchmark which can be used to study both object and relation hallucination at the same time. With comprehensive evaluations on Tri-HE, we observe that the relation hallucination issue is even more serious than object hallucination among existing LVLMs, highlighting a previously neglected problem towards reliable LVLMs. Moreover, based on our findings, we design a simple training-free approach that effectively mitigates hallucinations for LVLMs. |
| Researcher Affiliation | Academia | Junjie Wu Tsz Ting Chung Kai Chen EMAIL The Hong Kong University of Science and Technology Dit-Yan Yeung EMAIL The Hong Kong University of Science and Technology |
| Pseudocode | No | The paper describes the methods for knowledge graph extraction, hallucination judgment, and mitigation strategies using natural language and examples of prompts, but it does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our dataset and code for the reproduction of our experiments are available publicly at https://github.com/wujunjie1998/Tri-HE. |
| Open Datasets | Yes | Our dataset and code for the reproduction of our experiments are available publicly at https://github.com/wujunjie1998/Tri-HE. The construction of Tri-HE begins with images from the GQA dataset (Hudson & Manning, 2019), as the scene graph annotations provided by GQA naturally fit our triplet-level hallucination evaluation formulation. |
| Dataset Splits | No | The paper describes the construction of the Tri-HE benchmark by selecting 300 images from the GQA dataset and generating questions. It also mentions subsets of 20 or 25 images used for specific evaluations (human judgment, GPT-4V), but it does not provide explicit training, validation, and testing splits for models that might use Tri-HE for training. |
| Hardware Specification | Yes | All experiments are conducted on two Nvidia A100 GPUs. |
| Software Dependencies | Yes | Specifically, we primarily utilize GPT-4 in LLM judge to determine whether a given extracted triplet (v1, e, v2) Gθ... For both knowledge graph extraction and LLM judge, we utilize the gpt-4-1106-preview model via Open AI s API with default inference parameters. Specifically, we replace GPT-4 with LLa MA-3.3-70B-Instruct (abbrev., Llama-3.3) (Meta AI, 2024b) and re-evaluate all examples listed in Table 3. The first strategy is implemented with a natural language inference (NLI) (Reimers & Gurevych, 2019) model 1. 1https://huggingface.co/sentence-transformers/all-mpnet-base-v2 |
| Experiment Setup | Yes | The prompt templates and inference configurations used for LVLMs are detailed in Appendix A.4 and B. All experiments are conducted on two Nvidia A100 GPUs. Specifically, given an extracted triplet, we first calculate its cosine similarity scores with all triplets in the image scene graph G and only retain those ground truth (GT) triplets with similarity scores greater than 0.5... If the NLI score between the extracted triplet and ground truth triplets is lower than 0.6, suggesting the extracted triplet cannot be induced based on GT triplets, and therefore, resulting in a hallucination. |