Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Hallucination at a Glance: Controlled Visual Edits and Fine-Grained Multimodal Learning

Authors: Tianyi Bai, Yuxuan Fan, Qiu Jiantao, Fupeng Sun, Jiayi Song, Junlin Han, Zichen Liu, Conghui He, Wentao Zhang, Binhang Yuan

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our approach on the Micro Edit Detection benchmark, which includes carefully balanced evaluation pairs designed to test sensitivity to subtle visual variations across the same edit categories. Our method improves difference detection accuracy and reduces hallucinations compared to strong baselines, including GPT-4o. Moreover, it yields consistent gains on standard vision-language tasks such as image captioning and visual question answering.
Researcher Affiliation	Academia	Tianyi Bai1,2 , Yuxuan Fan3 , Qiu Jiantao2 , Fupeng Sun4, Jiayi Song5, Junlin Han6, Zichen Liu1, Conghui He2 , Wentao Zhang5,2 , Binhang Yuan1 1The Hong Kong University of Science and Technology 2Shanghai Artificial Intelligence Laboratory 3The Hong Kong University of Science and Technology (Guangzhou) 4Imperial College London, 5Peking University, 6Oxford University EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes the methodology in narrative text and mathematical formulations (Section 4.1, 4.3) but does not present any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code and datasets are publicly released at https://github.com/Relaxed-System-Lab/hallu_med.
Open Datasets	Yes	Code and datasets are publicly released at https://github.com/Relaxed-System-Lab/hallu_med. ... Using DOCCI [43] and Visual Genome [28], we develop a pipeline with filtering, semantic edit planning, and controlled image editing. ... We construct and release the Micro Edit Dataset (MED) and the Micro Edit Detection benchmark, targeting fine-grained vision-language reasoning.
Dataset Splits	Yes	To prevent contamination, benchmark samples are excluded from the fine-tuning dataset. The final benchmark contains 165 questions evenly distributed across all edit types... The MED-Real Set is created by sampling 50 minimally different image pairs from the MMVP benchmark [54]... This expands the evaluation set to 215 items, combining 165 synthetic edit pairs and 50 real-world pairs, offering a more comprehensive assessment of sensitivity to controlled differences and real-world generalization.
Hardware Specification	Yes	All the training processes were conducted using llamafactory [67]. Regarding image resolution and the number of image tokens, we adhere to the original settings specified by each model. Table 5: Hyperparameters for training Qwen2-VL & Qwen2.5-VL models ... GPU 8 NVIDIA A800
Software Dependencies	No	All the training processes were conducted using llamafactory [67].
Experiment Setup	Yes	In this section, we present all the hyperparameters we used to training the three kinds of models in Table 5, Table 6 and Table 7. All the training processes were conducted using llamafactory [67]. Regarding image resolution and the number of image tokens, we adhere to the original settings specified by each model. Table 5: Hyperparameters for training Qwen2-VL & Qwen2.5-VL models Hyperparameter Value Lo RA Rank 8 Lo RA α 16 Lo RA Dropout 0.1 Lo RA Target all GPU 8 NVIDIA A800 Batch Size 16 Gradient Accumulation Steps 8 Warmup Ratio 0.1 Learning Rate 1e-4 Learning Rate Scheduler Cosine Unfreeze Vision Tower True