Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging
Authors: Shiqi Chen, Jinghan Zhang, Tongyao Zhu, Wei Liu, Siyang Gao, Miao Xiong, Manling Li, Junxian He
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments, we demonstrate that model merging offers a successful pathway to transfer reasoning abilities from LLMs to VLMs in a training-free manner. We conduct extensive experiments by merging commonly used VLMs with LLMs trained on diverse math reasoning datasets (4). Our findings demonstrate that model merging consistently improves the reasoning capabilities of VLMs across all math benchmarks, with minimal impact on the perception-dominant tasks. We evaluate the performance on a series of VLM benchmarks. |
| Researcher Affiliation | Academia | 1City University of Hong Kong 2Hong Kong University of Science and Technology 3National University of Singapore 4Northwestern University. |
| Pseudocode | No | The paper describes methods and equations but does not contain any clearly labeled pseudocode or algorithm blocks. For example, equation (1) defines `τtask = θft θbase` and equation (2) defines `τvlm = θvlm θbase`, and equation (3) defines `θ vlm = θbase + λτvlm + (1 λ)τreason` but these are mathematical expressions not pseudocode. |
| Open Source Code | Yes | Our code is publicly available at: https://github.com/shiqichen17/VLM Merging. |
| Open Datasets | Yes | We evaluate the performance on a series of VLM benchmarks. We apply five benchmarks: Math Vista (Lu et al., 2024), Math Verse (Zhang et al., 2025), Math Vision (Wang et al., 2024a), Dynamath (Zou et al., 2024) and MMStar (Chen et al., 2024a). |
| Dataset Splits | Yes | We evaluate the performance on a series of VLM benchmarks. We apply five benchmarks: Math Vista (Lu et al., 2024), Math Verse (Zhang et al., 2025), Math Vision (Wang et al., 2024a), Dynamath (Zou et al., 2024) and MMStar (Chen et al., 2024a). Among these benchmarks, Math Vista is a diverse benchmark that includes both math-related reasoning tasks and general visual question answering tasks. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, processor types, memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers like Python 3.8, PyTorch 1.9) needed to replicate the experiment. |
| Experiment Setup | Yes | Hyperparameters In our main analysis and experimental sections, we employ a linear merging strategy for all task vectors under the same hyperparameter settings to ensure a fair comparison. This approach assigns a weight of 0.9 to the textual component of LLa VA-Next-LLa MA3-8B and 0.1 to the reasoning task vector, where λ = 0.9. This parameter is tuned on Math Vista based on Dart-Prop (Tong et al., 2024). We choose the best value from the range (0.8, 0.85, 0.9). |