Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Pixel Reasoner: Incentivizing Pixel Space Reasoning via Curiosity-Driven Reinforcement Learning
Authors: Alex Su, Haozhe Wang, Weiming Ren, Fangzhen Lin, Wenhu Chen
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our 7B model, Pixel-Reasoner, achieves 84% on V* bench, 74% on Tally QA-Complex, and 84% on Infographics VQA, marking the highest accuracy achieved by any open-source model to date. We demonstrate that this approach significantly improves VLM performance across diverse visual reasoning benchmarks. We evaluated our model and other baselines on four representative multimodal benchmarks using greedy decoding: Tally QA, V*, Infographic VQA, and MVBench. |
| Researcher Affiliation | Academia | Alex Su , Haozhe Wang , Weiming Ren , Fangzhen Lin , Wenhu Chen University of Waterloo , HKUST , USTC , Vector Institute Project Page: https://tiger-ai-lab.github.io/Pixel-Reasoner/ Corresponding to: EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes methods in narrative text and figures (e.g., Figure 1: Illustration of Pixel Reasoner, Figure 2: The Learning Trap) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | We include training details in the appendix, and will release code, models, and data to support reproducibility. (NeurIPS Paper Checklist, Question 5: Answer: [Yes] Justification: We will make code, data and models public.) |
| Open Datasets | Yes | Therefore, we first select three datasets: SA1B [Kirillov et al., 2023], Fine Web [Ma et al., 2024] and STARQA [Wu et al., 2024]. During RL, we construct 15,000 queries from our SFT dataset, Infographic VQA [Mathew et al., 2021], and publicly available datasets [Xu et al., 2025, Wu et al., 2024]. |
| Dataset Splits | No | Utilizing the data curation pipeline outlined in Section 3, we assembled a dataset of 7,500 trajectories for warm-start instruction tuning. This dataset includes 5,500 pixel-space reasoning trajectories synthesized using GPT-4o, spanning domains such as images, webpages, and videos. We also include 2,000 text-space reasoning trajectories to balance the use of visual operations. During RL, we construct 15,000 queries from our SFT dataset, Infographic VQA [Mathew et al., 2021], and publicly available datasets [Xu et al., 2025, Wu et al., 2024]. We evaluated our model and other baselines on four representative multimodal benchmarks... The paper describes the size and composition of the training data and lists the evaluation benchmarks, but does not provide explicit training/validation/test splits for the custom-assembled datasets or how these were partitioned for internal model development/tuning. |
| Hardware Specification | Yes | Pixel-Reasoner was trained on 8 A800(80G) GPUs, using Open-R1 and Open RLHF for instruction tuning and reinforcement learning respectively. Our 7B model is trained on 4 8 sets of A800 (80G) for 20 hours . |
| Software Dependencies | No | Pixel-Reasoner was trained on 8 A800(80G) GPUs, using Open-R1 and Open RLHF for instruction tuning and reinforcement learning respectively. We adopt GRPO [Deep Seek-AI et al., 2025] with selective sample relay due to vanishing advantages [Wang et al., 2025]. |
| Experiment Setup | Yes | For Instruction Tuning, we use a batch size of 128. The learning rate is 1e-6 with 10% warm up steps. For RL, we set employ a cosine learning rate schedule with initial learning rate 1e-6 and 3% warm up iterations. During RL training, we sample 8 trajectories per training query and set hyperparameters to α = 0.5, β = 0.05, H = 0.3, and N = 1. |