Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Unified Triplet-Level Hallucination Evaluation for Large Vision-Language Models

Authors: Junjie Wu, Tsz Ting Chung, Kai Chen, Dit-Yan Yeung

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Based on our framework, we further introduce Tri-HE, a novel Triplet-level Hallucination Evaluation benchmark which can be used to study both object and relation hallucination at the same time. With comprehensive evaluations on Tri-HE, we observe that the relation hallucination issue is even more serious than object hallucination among existing LVLMs, highlighting a previously neglected problem towards reliable LVLMs. Moreover, based on our findings, we design a simple training-free approach that effectively mitigates hallucinations for LVLMs.
Researcher Affiliation	Academia	Junjie Wu Tsz Ting Chung Kai Chen EMAIL The Hong Kong University of Science and Technology Dit-Yan Yeung EMAIL The Hong Kong University of Science and Technology
Pseudocode	No	The paper describes the methods for knowledge graph extraction, hallucination judgment, and mitigation strategies using natural language and examples of prompts, but it does not include any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our dataset and code for the reproduction of our experiments are available publicly at https://github.com/wujunjie1998/Tri-HE.
Open Datasets	Yes	Our dataset and code for the reproduction of our experiments are available publicly at https://github.com/wujunjie1998/Tri-HE. The construction of Tri-HE begins with images from the GQA dataset (Hudson & Manning, 2019), as the scene graph annotations provided by GQA naturally fit our triplet-level hallucination evaluation formulation.
Dataset Splits	No	The paper describes the construction of the Tri-HE benchmark by selecting 300 images from the GQA dataset and generating questions. It also mentions subsets of 20 or 25 images used for specific evaluations (human judgment, GPT-4V), but it does not provide explicit training, validation, and testing splits for models that might use Tri-HE for training.
Hardware Specification	Yes	All experiments are conducted on two Nvidia A100 GPUs.
Software Dependencies	Yes	Specifically, we primarily utilize GPT-4 in LLM judge to determine whether a given extracted triplet (v1, e, v2) Gθ... For both knowledge graph extraction and LLM judge, we utilize the gpt-4-1106-preview model via Open AI s API with default inference parameters. Specifically, we replace GPT-4 with LLa MA-3.3-70B-Instruct (abbrev., Llama-3.3) (Meta AI, 2024b) and re-evaluate all examples listed in Table 3. The first strategy is implemented with a natural language inference (NLI) (Reimers & Gurevych, 2019) model 1. 1https://huggingface.co/sentence-transformers/all-mpnet-base-v2
Experiment Setup	Yes	The prompt templates and inference configurations used for LVLMs are detailed in Appendix A.4 and B. All experiments are conducted on two Nvidia A100 GPUs. Specifically, given an extracted triplet, we first calculate its cosine similarity scores with all triplets in the image scene graph G and only retain those ground truth (GT) triplets with similarity scores greater than 0.5... If the NLI score between the extracted triplet and ground truth triplets is lower than 0.6, suggesting the extracted triplet cannot be induced based on GT triplets, and therefore, resulting in a hallucination.