Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Uni-RL: Unifying Online and Offline RL via Implicit Value Regularization

Authors: Haoran Xu, Liyuan Mao, Hui Jin, Weinan Zhang, Xianyuan Zhan, Amy Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate Uni-RL on a range of standard RL benchmarks across online, offline, and offline-to-online settings. In online RL, Uni-RL achieves higher sample efficiency than both off-policy methods without trust-region updates and on-policy methods with trust-region updates. In offline RL, Uni-RL retains the benefits of in-sample learning while outperforming IVR through better policy extraction. In offline-to-online RL, Uni-RL beats both constraint-based methods and unconstrained approaches by effectively balancing stability and adaptability.
Researcher Affiliation	Academia	1University of Texas at Austin 2Shanghai Jiao Tong University 3Nanyang Technological University 4Tsinghua University
Pseudocode	Yes	Algorithm 1 Unified Implicit Value Regularization
Open Source Code	Yes	Code: https://github.com/ryanxhr/Uni-RL
Open Datasets	Yes	We evaluate Uni-RL on 6 widely used RL benchmarks and 23 environments across online, offline, and offline-to-online settings... Online RL Gym, Py Bullet, DMControl... Offline RL D4RL Mu Jo Co and Antmaze... Offline-to-online RL D4RL Antmaze, Kitchen and Adroit.
Dataset Splits	Yes	Offline RL. In the offline setting, we evaluate Uni-RL on the D4RL benchmark (Fu et al., 2020) and compare it with several state-of-the-art algorithms... For Mu Jo Co environments, we have the following datasets. halfcheetah/hopper/walker2d-m (medium): Collected by a policy with moderate performance, typically reaching around one-third of expert returns. These datasets represent structured but suboptimal behavior. halfcheetah/hopper/walker2d-m-r (medium-replay): Contains the replay buffer of the mediocre SAC policy. It includes a wide range of off-policy transitions, many of which are suboptimal or noisy. halfcheetah/hopper/walker2d-m-e (medium-expert): A 50-50 mixture of medium and expert trajectories. These datasets are designed to test whether algorithms can leverage nearoptimal data when it is partially present.
Hardware Specification	No	This research used the computational cluster resource provided by the Texas Advanced Computing Center at UT Austin.
Software Dependencies	No	We implemented Uni-RL using Py Torch and ran it on all datasets... In online experiments, we run baselines using the implementation from ACME (Hoffman et al., 2020)... In offline experiments, baseline results for other methods were directly sourced from their respective papers. In offline-to-online experiments, we run baselines using the pytorch implementation from CORL (Tarasov et al., 2024b).
Experiment Setup	Yes	For the network, we use 3-layer MLP with 256 hidden units and Adam optimizer (Kingma and Ba, 2015) with a learning rate of 1 x 10^-3 for both policy and value functions in all tasks. We also use a target network with soft update weight 5 x 10^-3 for Q-function. We clip the output of the weight function by max(w(x), 0) to ensure a non-negative BC weight. We use 0-1 normalization to the weight in each batch and then clip it to [wmin, wmax] where we set wmin to 0.1 and wmax to 0.9 through all the datasets... We search α over [0.1, 0.5, 1.0, 2.0] and λ over [0.01, 0.05, 0.1, 0.2] The best value of α and λ for all environments are listed in Table 3.