Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Scaling Off-Policy Reinforcement Learning with Batch and Weight Normalization

Authors: Daniel Palenicek, Florian Vogt, Joe Watson, Jan Peters

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our proposed approach reliably scales with increasing UTD ratios, achieving competitive performance across 25 challenging continuous control tasks on the Deep Mind Control Suite and Myo Suite benchmarks, notably the complex dog and humanoid environments. This work eliminates the need for drastic interventions, such as network resets, and offers a simple yet robust pathway for improving sample efﬁciency and scalability in model-free reinforcement learning.
Researcher Affiliation	Academia	Daniel Palenicek1,2 Florian Vogt3 Joe Watson4 Jan Peters1,2,5,6 1Technical University of Darmstadt 2hessian.AI 3University of Freiburg 4University of Oxford 5German Research Center for AI (DFKI) 6Robotics Institute Germany (RIG)
Pseudocode	No	The paper describes methods and theoretical derivations in prose and includes proofs in the appendix (Appendix A and B), but does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	To aid reproducibility, we plan to release the code together with the camera-ready version of the paper. At the current time we do not provide the code, however, we already provide all implementation details in the paper. We plan to release the code together with the publication of the paper.
Open Datasets	Yes	To evaluate the effectiveness of our proposed Cross Q + WN method, we conduct a comprehensive set of experiments on the Deep Mind Control Suite [41] and Myo Suite [7] benchmarks.
Dataset Splits	No	Each experiment is run for 1 million environment steps and across 10 random seeds to ensure statistical robustness. We evaluate agents every 25, 000 environment steps for 5 trajectories.
Hardware Specification	Yes	All experiments were run on a compute cluster with RTX 3090 and A5000 GPUs, where all 10 seeds run in parallel on a single GPU via jax.vmap.
Software Dependencies	No	Our implementation is based on the SAC implementation of jaxrl codebase [25].
Experiment Setup	Yes	Table 1 gives an overview of the hyperparameters that were used for each algorithm that was considered in this work.