Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Perception-R1: Pioneering Perception Policy with Reinforcement Learning

Authors: En Yu, Kangheng Lin, Liang Zhao, jisheng yin, Yana Wei, Yuang Peng, Haoran Wei, Jianjian Sun, Chunrui Han, Zheng Ge, Xiangyu Zhang, Daxin Jiang, Jingyu Wang, Wenbing Tao

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experimental section evaluates Perception-R1 s performance on visual perception tasks ( 5.1), followed by analytical experiments exploring reinforcement learning (RL) s role in perception policy learning ( 5.2). Finally, it discusses the interplay between visual perception and RL, along with key insights for perception policy learning ( 5.3). 5.1 Performance Landscape in Perception Tasks We evaluate Perception-R1 on mainstream perception tasks: visual grounding, counting, OCR, and object detection. Experiments use the datasets described in 4.3 and benchmarks for image understanding. Results are in Tables 1 4.
Researcher Affiliation Collaboration 1Huazhong University of Science and Technology 2Beijing University of Posts and Telecommunications 3Step Fun 4Johns Hopkins University 5Tingshua University EMAIL
Pseudocode No The paper formulates the optimization objective of GRPO mathematically in Section 3 but does not present it as a structured pseudocode or algorithm block. For example: "The optimization objective of GRPO can be formulated as following: JGRPO(θ) = E[q P (Q),{oi}G i=1 πθold(O|q)] ... (1)", which is a mathematical equation.
Open Source Code Yes Project code is available at https://github.com/linkangheng/PR1.
Open Datasets Yes Task and Data Setting. Given that Perception-R1 is primarily oriented towards pure visual and visuallanguage tasks, we select several mainstream and representative downstream tasks for perception policy learning, specifically including visual grounding, e.g., ref COCO [71] / + [71] / g [40], OCR, i.e., Page OCR [34], visual counting, i.e., Pixmo-Count [13], and object detection, i.e., COCO2017 [32].
Dataset Splits Yes For each task, a subset (5k 10k) of samples are respectively extracted as base data for individual post-training. More details are in the appendix A.1. Table 7: Training dataset statistics. Notably, we do not mix the data from different perception tasks for joint training because the rewards for different tasks vary. tasks datasets Original Used Ratio visual grounding Ref COCO / Ref COCO+ / Ref COCOg 320k 5k 1.56% OCR Page OCR 50k 5k 10% visual counting Pix Mo-Count 1.9M 10k 0.5% object detection COCO2017 110k 110k 100% overall 2.38M 130k -
Hardware Specification Yes Answer: [Yes] Justification: We report it on Section 4.3 and all experiments are conducted on NVIDIA A100 Tensor Core GPU.
Software Dependencies No The paper mentions Qwen2-VL and Qwen2.5-VL as models but does not list specific software libraries or programming languages with their version numbers. For example, it does not state "Python 3.x" or "PyTorch 1.x".
Experiment Setup Yes Training Setting. We focus on the RL-based post-training stage of MLLM. All the selected base models have already undergone pre-training and SFT stage. During RL stage, the initial learning rate is set as 1e 6 with 8 rollouts by default and a batch size of 1. The following are some important hyper-parameters during post-training. Prompts detailed settings are in the appendix A.1. Gradient Accmulation Rollout G KL Coefficient Max Response Len Temperature 2 8 0.04 2048 1.0