Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Seeing is Believing? Mitigating OCR Hallucinations in Multimodal Large Language Models
Authors: zhentao he, Can Zhang, Ziheng Wu, Zhenghao Chen, Yufei Zhan, Yifan Li, Zhao Zhang, Xian Wang, Minghui Qiu
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on Qwen2.5-VL demonstrate that our 7B-parameter model achieves a 28% absolute improvement in hallucination-free accuracy over GPT-4o on KIE-HVQA and there is no significant performance drop in standard tasks, highlighting both effectiveness and robustness. |
| Researcher Affiliation | Collaboration | Zhentao He1, Can Zhang1, Ziheng Wu1, Zhenghao Chen1 Yufei Zhan1,2 Yifan Li1,3 Zhao Zhang1 Xian Wang1 Minghui Qiu1 , 1Byte Dance 2CASIA 3RUC |
| Pseudocode | Yes | Algorithm 1 Reward Function for OCR Task |
| Open Source Code | No | We regret that we are unable to provide the source code at this time as it is currently undergoing our institution s internal review and clearance process for open access. The code will be released soon. |
| Open Datasets | Yes | Data is available at https://huggingface.co/datasets/bytedance-research/KIE-HVQA. |
| Dataset Splits | Yes | This dataset includes 2,000 annotated training samples and 400 rigorously curated test instances spanning diverse document types, including identity cards, receipts, and invoices. |
| Hardware Specification | No | The paper describes the model (Qwen-2.5-VL-7B-Instruct) and training time (approximately 4 hours for SFT, several hours for GPRO) and frameworks used (LLa MA-Factory, Easy-R1), but does not specify any particular hardware like GPU models, CPU types, or memory used for the experiments in Section 5.1 or elsewhere in the main text. |
| Software Dependencies | No | The paper mentions using the LLa MA-Factory framework [42] and Easy-R1 framework [41] but does not provide specific version numbers for these or any other software libraries or programming languages used in the experiments. |
| Experiment Setup | Yes | For the cold-start initialization, we used Qwen-2.5-VL-7B-Instruct as the base model and performed supervised fine-tuning for 5 epochs with a learning rate of 1e-6 and a data rollout batch size of 512. |