Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
A Clean Slate for Offline Reinforcement Learning
Authors: Matthew T. Jackson, Uljad Berdica, Jarek Liesen, Shimon Whiteson, Jakob Foerster
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Leveraging these streamlined implementations, we propose Unifloral, a unified algorithm that encapsulates diverse prior approaches within a single, comprehensive hyperparameter space, enabling algorithm development in a shared hyperparameter space. Using Unifloral with our rigorous evaluation procedure, we develop two novel algorithms TD3AWR (model-free) and Mo BRAC (model-based) which substantially outperform established baselines. All code for this project can be found in our public codebase. [...] In Figure 3, we evaluate a range of prior algorithms (list in Appendix A). [...] In Figure 6, we show that TD3-AWR's performance curve strictly dominates Re BRAC on 6 out of 9 datasets and is dominated by Re BRAC in only 1. [...] Figure 7 shows how Mo BRAC outperforms other model-based methods for all datasets, except for MOPO in maze2d-large-v1. |
| Researcher Affiliation | Academia | Matthew T. Jackson* Uljad Berdica* Jarek Liesen* Shimon Whiteson Jakob N. Foerster University of Oxford EMAIL |
| Pseudocode | No | The paper describes algorithmic components using mathematical notation and descriptive text (e.g., Section I.1 Critic Objective, I.2 Actor Objective, I.3 Dynamics Modelling with equations like (4) through (11)), but it does not include explicitly formatted pseudocode or algorithm blocks as defined by the question. |
| Open Source Code | Yes | All code for this project can be found in our public codebase. [...] To ensure ease of adoption and reproducibility, we release a straightforward software interface for performing this evaluation procedure, thereby empowering future work to evaluate offline RL algorithms robustly and transparently. [...] Yes, the paper features an anonymous repository and is built to optimize for transparent ablations, reproducibility and evaluations. |
| Open Datasets | Yes | Alarmingly, the majority of offline RL methods considered in this work were evaluated only on Mu Jo Co and Adroit tasks from the D4RL suite [20]. [...] We use standard D4RL [20] datasets available through their API. |
| Dataset Splits | No | The paper uses standard D4RL datasets, which are pre-collected static datasets for offline RL. While this implicitly defines the training data, the paper does not explicitly provide numerical training/test/validation splits (e.g., percentages or sample counts) for these datasets in the main text or appendices. |
| Hardware Specification | Yes | algorithms trained for 1M update steps on Half Cheetah medium expert using a single L40S GPU. |
| Software Dependencies | No | Secondly, we implement our algorithms in end-to-end compiled JAX, leading to major speed-ups against competing implementations (Figure 5). The paper mentions JAX as a core framework but does not provide specific version numbers for JAX or any other software libraries used. |
| Experiment Setup | Yes | A Unified Hyperparameter Space for Offline RL [...] Table 5: Hyperparameters of prior algorithms in Unifloral light gray values indicate inactive settings. (Table 5 lists specific values and ranges for Batch size, learning rates, discount factor, Polyak step size, actor/critic layers and hidden sizes, coefficients for actor/critic objectives, AWR parameters, and more). |