Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Uncertainty-Based Smooth Policy Regularisation for Reinforcement Learning with Few Demonstrations
Authors: Yujie Zhu, Charles Hepburn, Matthew Thorpe, Giovanni Montana
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our code is available at https://github.com/Yujie Zhu7/SPRe D. Our theoretical analysis establishes several key properties... Through experiments across eight robotic tasks, we demonstrate that SPRe D consistently outperforms existing methods, achieving up to 14 success rates with the same interaction steps in complex manipulation tasks like block stacking (0.920 vs. 0.064) and significant improvements even with severely limited or suboptimal demonstrations. |
| Researcher Affiliation | Academia | Yujie Zhu1, Charles A. Hepburn1, Matthew Thorpe1, and Giovanni Montana1,2 1Department of Statistics, 2Warwick Manufacturing Group University of Warwick CV4 7AL EMAIL |
| Pseudocode | Yes | Algorithm 1 Reinforcement Learning with Smooth Policy regularisation from Demonstrations |
| Open Source Code | Yes | Our code is available at https://github.com/Yujie Zhu7/SPRe D. |
| Open Datasets | Yes | We evaluate SPRe D on eight challenging robotics tasks from Open AI Gym s Fetch and Shadow Dexterous Hand environments [49], simulated in Mu Jo Co [50]. |
| Dataset Splits | Yes | We use 1000 demonstration episodes for the challenging 3-block stacking task, with 100 demonstrations for all other tasks. See Appendix D and Appendix E for additional details about environments and demonstrations. For each environment, we use 100 demonstration episodes, with the exception of Fetch Stack3 where we use 1000 episodes due to its greater complexity. |
| Hardware Specification | Yes | All the experiments were performed with a single Ge Force GTX 3090 GPU and an Intel Core i9-11900K CPU at 3.50GHz. |
| Software Dependencies | No | The paper mentions using the Adam optimizer [60] and Welford’s online algorithm [61] but does not provide specific version numbers for these or other major software libraries like Python, PyTorch, or TensorFlow. |
| Experiment Setup | Yes | We implement our approach using hyperparameters consistent with prior work [6]: mini-batch sizes NR = 1024 and ND = 128 for experience and demonstration buffers respectively, discount factor γ = 0.98, and loss weights λ1 = 10 3 and λ2 = 1 128. Both actor and critic networks employ identical architectures consisting of two hidden layers with 256 neurons each and Re LU activations. The actor s output layer uses a tanh activation to bound actions within the environment s range. We use the Adam optimizer [60] with learning rate 10 3 for all networks. For SPRe D-E, the scaling constant α was set to 10 based on preliminary experiments. |