Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning
Authors: Senqiao Yang, Junyi Li, Xin Lai, Jinming Wu, Wei Li, Zejun MA, Bei Yu, Hengshuang Zhao, Jiaya Jia
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate the superiority, efficiency, and effectiveness of our method. |
| Researcher Affiliation | Collaboration | Senqiao Yang 1,3 Junyi Li 2,3 Xin Lai 3 Jinming Wu3 Wei Li 3 Bei Yu1 Hengshuang Zhao 2 Jiaya Jia1,4 1CUHK 2HKU 3Byte Dance 4HKUST |
| Pseudocode | No | The paper includes mathematical equations and code snippets for evaluation methods but no clearly labeled 'Pseudocode' or 'Algorithm' block, nor structured steps formatted explicitly as an algorithm. |
| Open Source Code | Yes | Codes and models: https://github.com/dvlab-research/Vision Think |
| Open Datasets | Yes | We evaluate Vision Think on several general VQA benchmarks, including Chart QA [45], OCRBench [37], Math Vista [42], MMVet [91], Real World QA [74], and Math Verse [96], etc. ... The MME benchmark [15]... MMMU [93] serves as a benchmark... |
| Dataset Splits | No | To enable our model can decide when high resolution is necessary, we collect corresponding VQA samples, including both cases requiring high-resolution images and cases adequately answered using downsampled images. ...we selected 10K samples that require high-resolution images and 10K samples that do not, to train our model. |
| Hardware Specification | No | The paper mentions that 'Inference Details. In this paper, we use the lmms-eval [94] to evaluate the model s performance. Besides, in order to save the GPU memory and improve the inference speed, we utilize the v LLM[25] framework and set the temperature to zero for inference.' However, it does not specify any particular GPU models, CPUs, or other hardware details used for the experiments. |
| Software Dependencies | No | In this paper, we conduct experiments using Qwen2.5-VL-7B-Instruct [5] as the base model, trained with the ve RL framework [58]. We use a total batch size of 512 with mixed-precision (FP16) training. ... For inference, we use the v LLM framework and set the temperature to 0. |
| Experiment Setup | Yes | For training, we employ ve RL[58] framework and use a total batch size of 512, with a mini-batch size of 32, we set the policy LLM learning rate to 1e 6 and sample 16 responses per prompt, ensuring a stable and effective training process. For inference, we use the v LLM framework and set the temperature to 0. |