Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Progress Reward Model for Reinforcement Learning via Large Language Models

Authors: Xiuhui Zhang, Ning Gao, Xingyu Jiang, Yihui Chen, Yuheng Pan, Mohan Zhang, Yue Deng

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on robotics control tasks demonstrate that our approach outperforms both LLM-based planning and reward methods, achieving state-of-the-art performance. The code is available at https://github.com/deng-ai-lab/PRM4RL
Researcher Affiliation Academia Beihang University 37 Xueyuan Road, Haidian District, Beijing EMAIL EMAIL
Pseudocode Yes Algorithm 1 PRM4RL Require: pythonic LLM prompt prompt, sentence encoder E, policy πθ, MDP M = S, A, T , R, γ, ρ0 , RL optimization algorithm Optimize_Algo // Request LLM plan_list, Ψ(s), Φ(s) = LLM(prompt) // Augment State Space P : prior = E(plan_list[Ψ(s)]) // Augment Reward Function RP RM = γ Φ(st+1) Φ(st) + I(st+1) rbonus // Optimizing policy under augmented MDP M = S + P, A, T , RP RM, γ, ρ0 θ = Optimize_Algo(M )
Open Source Code Yes The code is available at https://github.com/deng-ai-lab/PRM4RL
Open Datasets Yes The evaluation results in Metaworld[27] and Maniskill[28] as shown in Figure 1d demonstrate the effectiveness of our proposed framework, with significant improvements in success rate and reduced LLM calls, outperforming existing methods in both performance and efficiency. Environments Meta World[27] is an open-source simulated benchmark features a Sawyer robot interacting with a tabletop setup that includes drawer, window, ball, faucet, door and many objects. Maniskill[28] is an advanced robotics simulation platform designed for high-fidelity manipulation tasks.
Dataset Splits No The paper mentions running experiments with "5 random seeds" and evaluating on "unseen tasks" for generalization, but it does not specify explicit training/test/validation splits for the datasets in terms of percentages, absolute sample counts, or predefined split files. For instance, it doesn't state how the data within Metaworld or Maniskill tasks are divided for training and evaluation during a single run, beyond the use of random seeds for experimental variability.
Hardware Specification Yes We utilize a 4 NVidia Geforce RTX-3090 graphic cards, 128 core CPUs, and 256 Gi B memory server for RL training.
Software Dependencies Yes The framework is implemented based on the Stable-Baselines3 framework[45], ensuring consistency across all methods. We employ the same RL algorithm with identical training parameters for all methods. For all LLM-related requirements, we utilize Open AI s GPT-4o[46]. Details of the hyperparameters are provided in Appendix A.3. To ensure robustness and reliability, all experiments are conducted with five different random seeds. We employed the open-source PPO[44] and SAC implementation[43] from Stable-Baselines3[45] 2, and list the hyper-parameters in Table A.3 and A.3. 2stable-baselines3 v2.6.0 (MIT License), code available at https://github.com/DLR-RM/ stable-baselines3
Experiment Setup Yes Details of the hyperparameters are provided in Appendix A.3. Table 1: Hyper-parameter of SAC algorithm applied to each task. Table 2: Hyper-parameter of PPO algorithm applied to each task.