Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Scaling Off-Policy Reinforcement Learning with Batch and Weight Normalization
Authors: Daniel Palenicek, Florian Vogt, Joe Watson, Jan Peters
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our proposed approach reliably scales with increasing UTD ratios, achieving competitive performance across 25 challenging continuous control tasks on the Deep Mind Control Suite and Myo Suite benchmarks, notably the complex dog and humanoid environments. This work eliminates the need for drastic interventions, such as network resets, and offers a simple yet robust pathway for improving sample ef๏ฌciency and scalability in model-free reinforcement learning. |
| Researcher Affiliation | Academia | Daniel Palenicek1,2 Florian Vogt3 Joe Watson4 Jan Peters1,2,5,6 1Technical University of Darmstadt 2hessian.AI 3University of Freiburg 4University of Oxford 5German Research Center for AI (DFKI) 6Robotics Institute Germany (RIG) |
| Pseudocode | No | The paper describes methods and theoretical derivations in prose and includes proofs in the appendix (Appendix A and B), but does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | To aid reproducibility, we plan to release the code together with the camera-ready version of the paper. At the current time we do not provide the code, however, we already provide all implementation details in the paper. We plan to release the code together with the publication of the paper. |
| Open Datasets | Yes | To evaluate the effectiveness of our proposed Cross Q + WN method, we conduct a comprehensive set of experiments on the Deep Mind Control Suite [41] and Myo Suite [7] benchmarks. |
| Dataset Splits | No | Each experiment is run for 1 million environment steps and across 10 random seeds to ensure statistical robustness. We evaluate agents every 25, 000 environment steps for 5 trajectories. |
| Hardware Specification | Yes | All experiments were run on a compute cluster with RTX 3090 and A5000 GPUs, where all 10 seeds run in parallel on a single GPU via jax.vmap. |
| Software Dependencies | No | Our implementation is based on the SAC implementation of jaxrl codebase [25]. |
| Experiment Setup | Yes | Table 1 gives an overview of the hyperparameters that were used for each algorithm that was considered in this work. |