Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models

Authors: Zhongxing Xu, Chengzhi Liu, Qingyue Wei, Juncheng Wu, James Y Zou, Xin Wang, Yuyin Zhou, Sheng Liu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To systematically study this phenomenon, we introduce RH-AUC, a metric that quantifies how a model s perception accuracy changes with reasoning length, enabling evaluation of whether the model preserves visual grounding while reasoning. We also release RH-Bench, a diagnostic benchmark covering diverse multimodal tasks, designed to jointly assess the balance of reasoning ability and hallucination. We find that (i) larger models generally exhibit a better balance between reasoning and perception; (ii) reasoning and perception balance depends more on the types and domains of the training data than its volume.
Researcher Affiliation Academia 1 Stanford University 2UC Santa Barbara 3UC Santa Cruz
Pseudocode No Specifically, we extract steering directions from the post-attention hidden states by computing the difference of latent states between long and short reasoning trajectories. These direction vectors are obtained and applied across all layers of the text decoder, with a scaling factor controlling both the magnitude of guidance on the reasoning length. Specifically, we collect responses from the test benchmark and categorize them into long reasoning traces Rlong and short reasoning traces Rshort based on token length. The query and reasoning steps for each sample are input into the model, from which hidden representations Sℓare extracted at each layer. Sℓ(q, t) denotes the hidden representation at layer ℓfor token position t in the response to query q. We compute the average hidden representation over reasoning tokens, where Hi represents the set of token positions within the reasoning span. The average representation is then calculated across the long and short reasoning traces to obtain layerwise embeddings: Sℓ long = 1 |Rlong| t Hi Sℓ(q, t), Sℓ short = 1 |Rshort| t Hi Sℓ(q, t) (1) The reasoning length direction at layer ℓis defined as the difference between the long and short embeddings, denoted as dℓ, which captures the variation in the model s representation resulting from different reasoning chain lengths. To adjust the hidden representation based on this direction, We introduce a parameter α [ 0.15, 0.15] to dynamically control the reasoning length and its magnitude. As α increases, the length of the reasoning chain extends, as shown below: dℓ= Sℓ long Sℓ short, Sℓ steering = Sℓ+ αdℓ. (2)
Open Source Code Yes 1https://mlrm-halu.github.io/. Work was partially done while ZX was visiting Stanford. ... As mentioned in the Abstract, our code and data will be made publicly available. ... We also release RH-Bench, a diagnostic benchmark covering diverse multimodal tasks
Open Datasets Yes To systematically study this phenomenon, we introduce RH-AUC, a metric ... We also release RH-Bench, a diagnostic benchmark covering diverse multimodal tasks ... Benchmark Overview. RH-Bench consists of two types of tasks: reasoning and perception, with each task including two types of questions: multiple-choice and open-ended. The reasoning task includes 500 samples sourced from Math Vision [44], Math Vista [27], MMMU [55], and Science QA [28], while the visual perception task includes 500 samples from MMhalu, MMVP, Hallusion Bench, and VMCBench.
Dataset Splits No RH-Bench consists of two types of tasks: reasoning and perception, with each task including two types of questions: multiple-choice and open-ended. The reasoning task includes 500 samples sourced from Math Vision [44], Math Vista [27], MMMU [55], and Science QA [28], while the visual perception task includes 500 samples from MMhalu, MMVP, Hallusion Bench, and VMCBench. Both task types use accuracy as the evaluation metric. (This describes the benchmark composition, not how experimental data is split for training/testing the models under study.)
Hardware Specification No The paper does not explicitly describe the hardware used for its experiments.
Software Dependencies No For open-ended questions, both tasks are evaluated using GPT-4o. ... All models are post-trained on Qwen2.5-VL-3B or Qwen2.5-VL-7B, which are used as baseline models. (No specific versions for general software dependencies are provided.)
Experiment Setup Yes To systematically control the reasoning length in reasoning models, we adopt three strategies: (1) Token Budget Forcing: A hard constraint on reasoning length is enforced by predefining a generation budget at decoding time, directly limiting the number of tokens allocated for the reasoning. (2) Test Time Scaling: Reasoning is incrementally extended during inference through staged generation. The model first produces partial reasoning under a 4096-token constraint and halts midway. It is then prompted to continue by appending a simple token ('Wait'), enabling soft extension of reasoning while preserving contextual coherence. (3) Latent State Steering: ... We introduce a parameter α [ 0.15, 0.15] to dynamically control the reasoning length and its magnitude. ... The RH-AUC is then computed using the trapezoidal rule as: RT (i+1) RT (i) 2 (HT (i+1) + HT (i)) , (3) where n is the number of evaluated reasoning lengths.