Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Robot-R1: Reinforcement Learning for Enhanced Embodied Reasoning in Robotics

Authors: Dongyoung Kim, Sumin Park, Huiwon Jang, Jinwoo Shin, Jaehyung Kim, Younggyo Seo

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments show that models trained with Robot-R1 outperform SFT methods on embodied reasoning tasks. To rigorously evaluate ROBOT-R1, we also introduce a new benchmark that demands the diverse embodied reasoning capabilities for the task. Our experiments show that models trained with Robot-R1 outperform SFT methods on embodied reasoning tasks.
Researcher Affiliation	Collaboration	Dongyoung Kim1,4, Sumin Park1 , Huiwon Jang1, Jinwoo Shin1,4 1KAIST, 2Yonsei University, 3UC Berkeley, 4RLWRLD EMAIL
Pseudocode	No	The paper describes methods and processes through narrative text and figures, but it does not contain explicit pseudocode blocks or algorithm listings with structured steps.
Open Source Code	Yes	All workspaces related to data generation and learning are open to the public. We utilize open source model for training. Footnotes 1, 2, and 3 point to GitHub repositories: 1https://github.com/stepjam/ARM, 2https://github.com/hiyouga/Easy R1, 3https://github.com/2U1/Qwen2-VL-Finetune.
Open Datasets	Yes	The training data used in our experiments are generated using the built-in data generator in RLBench [53]. We also introduce a new benchmark called the ROBOT-R1 Bench. We consider the Embodied Bench Manipulation benchmark [19], a vision-driven agent assessment platform built upon the RLBench simulation environment [53]. We evaluate the model s performance on the Spatial RGPT benchmark [20]. We extend the ROBOT-R1 Bench to the ROBOT-R1 Bridge Bench [66]. We further evaluate whether the model trained with ROBOT-R1 generalizes to other robot agent environments beyond RLBench, using the VLABench [67] evaluation pipeline. In LIBERO simulation, the environment is simulated using a Franka Panda Arm [70].
Dataset Splits	No	The training data used in our experiments are generated using the built-in data generator in RLBench [53]. We collect 50 demonstrations per task from the variation 0 settings. Consequently, each task contains approximately 2.5K questions, resulting in a total of around 7.5K QA pairs across the three QA tasks used for training. The ROBOT-R1 Bench dataset consists of 10 tasks from RLBench [53]. For each task, we randomly sample five frames from expert demonstratrions, resulting in a total of 50 images. The final dataset consists of 215 open-ended questions in total: 65 for spatial reasoning and 50 for each of the other three reasoning types.
Hardware Specification	Yes	All experiments are conducted on a single node consisting of four A100 80GB GPUs.
Software Dependencies	No	The paper mentions using Qwen2.5-7b-VL-Ins as the base model and references to 'Easy R1 workspace' and 'Qwen2-VL-Finetune workspace' for hyperparameters, and 'GR00T-N1.5 repository' for policy heads, but it does not specify explicit version numbers for general software libraries or tools (e.g., Python, PyTorch versions).
Experiment Setup	Yes	For the training process, we utilize a batch size of 128 over a 5 epoch. During GRPO updates, 5 samples were generated per prompt with a sampling temperature of 1.0. The rollout batch size is set to 512. We use a learning rate of 1.0 10 6 with a weight decay of 1.0 10 2. For the SFT baselines, we use the same batch size, but learning rate is 1.0 10 5.