Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning

Authors: Senqiao Yang, Junyi Li, Xin Lai, Jinming Wu, Wei Li, Zejun MA, Bei Yu, Hengshuang Zhao, Jiaya Jia

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate the superiority, efficiency, and effectiveness of our method.
Researcher Affiliation Collaboration Senqiao Yang 1,3 Junyi Li 2,3 Xin Lai 3 Jinming Wu3 Wei Li 3 Bei Yu1 Hengshuang Zhao 2 Jiaya Jia1,4 1CUHK 2HKU 3Byte Dance 4HKUST
Pseudocode No The paper includes mathematical equations and code snippets for evaluation methods but no clearly labeled 'Pseudocode' or 'Algorithm' block, nor structured steps formatted explicitly as an algorithm.
Open Source Code Yes Codes and models: https://github.com/dvlab-research/Vision Think
Open Datasets Yes We evaluate Vision Think on several general VQA benchmarks, including Chart QA [45], OCRBench [37], Math Vista [42], MMVet [91], Real World QA [74], and Math Verse [96], etc. ... The MME benchmark [15]... MMMU [93] serves as a benchmark...
Dataset Splits No To enable our model can decide when high resolution is necessary, we collect corresponding VQA samples, including both cases requiring high-resolution images and cases adequately answered using downsampled images. ...we selected 10K samples that require high-resolution images and 10K samples that do not, to train our model.
Hardware Specification No The paper mentions that 'Inference Details. In this paper, we use the lmms-eval [94] to evaluate the model s performance. Besides, in order to save the GPU memory and improve the inference speed, we utilize the v LLM[25] framework and set the temperature to zero for inference.' However, it does not specify any particular GPU models, CPUs, or other hardware details used for the experiments.
Software Dependencies No In this paper, we conduct experiments using Qwen2.5-VL-7B-Instruct [5] as the base model, trained with the ve RL framework [58]. We use a total batch size of 512 with mixed-precision (FP16) training. ... For inference, we use the v LLM framework and set the temperature to 0.
Experiment Setup Yes For training, we employ ve RL[58] framework and use a total batch size of 512, with a mini-batch size of 32, we set the policy LLM learning rate to 1e 6 and sample 16 responses per prompt, ensuring a stable and effective training process. For inference, we use the v LLM framework and set the temperature to 0.