Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Visual Thoughts: A Unified Perspective of Understanding Multimodal Chain-of-Thought

Authors: ZIhui Cheng, Qiguang Chen, Xiao Xu, Jiaqi Wang, Weiyun Wang, Hao Fei, Yidong Wang, Alex Jinpeng Wang, Zhi Chen, Wanxiang Che, Libo Qin

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To verify this, during experiments, we first verify the effectiveness of visual thoughts in improving MCo T performance. Then, to extensively explore how visual thoughts work in different expressions, we categorize four major strategies: Natural Language, Structured Language, Edited Image, and Generative Image visual thoughts. Moreover, we also investigate the role of internal attention mechanisms and information flow within LVLMs to analyze the rationale behind visual thoughts. Our findings reveal the following: (1) Removing visual thoughts and forcing reasoning solely from the original image can impair performance, even more than reasoning directly from the query. (2) Different expressions of visual thoughts are more effective in certain scenarios, depending on their expression clarity and efficiency. (3) Visual thoughts not only carry visual information but also serve as primary intermediaries, connecting the input image to deeper transformer layers and enabling more advanced cognitive processing in LVLMs.
Researcher Affiliation	Collaboration	1 School of Computer Science and Engineering, Central South University 2 Research Center for Social Computing and Interactive Robotics, Harbin Institute of Technology 3 Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen 4 Text Computing and Cognitive Intelligence Ministry of Education Engineering Research Center, Guizhou University 5 Chinese University of Hong Kong 6 Shanghai AI Laboratory 7 National University of Singapore 8 Peking University 9 Byte Dance Seed (China)
Pseudocode	No	The paper includes mathematical formulations in Section 2.1, but no explicit blocks or figures labeled as 'Pseudocode' or 'Algorithm' with structured, step-by-step procedures are found.
Open Source Code	No	Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: We will release our code in the official version of the subsequent paper to provide reproduction and provide more help to the future community.
Open Datasets	Yes	Benchmark Settings We select benchmarks from both math and commonsense categories. For the math tasks, we choose Iso Bench [10] involving tasks such as chess, math, graph, etc. For the commonsense tasks, we select datasets including MMVP [39], V*Bench [47], M3Co T-Commonsense [4], and Co MT [7], which assess the LVLMs capabilities such as visual grounding and object detection, fine-grained identification, and Co T reasoning.
Dataset Splits	No	The paper mentions using several benchmark datasets like Iso Bench, MMVP, V*Bench, M3Co T-Commonsense, and Co MT, but it does not explicitly provide information about the specific training, validation, and test splits used for these datasets within the paper. It does not state percentages, sample counts, or refer to specific standard splits.
Hardware Specification	Yes	In addition, all open source models complete inference on 2 A6000 48G.
Software Dependencies	No	The paper mentions using models like LLa VA-1.5, Qwen2-VL, GPT-4o-mini, and GPT-4o. However, it does not specify any particular software libraries (e.g., PyTorch, TensorFlow) or their version numbers, nor does it list specific programming language versions (e.g., Python 3.x).
Experiment Setup	Yes	Model Settings We conduct all our experiments using four models, including LLa VA-1.5 [21], Qwen2-VL [42], GPT-4o-mini [32], and GPT-4o [32]. For the GPT series models, we adjust the temperature parameter within [0,2]; for the open-source models, we adjust the temperature parameter within the range [0,2].