Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Uni-RL: Unifying Online and Offline RL via Implicit Value Regularization
Authors: Haoran Xu, Liyuan Mao, Hui Jin, Weinan Zhang, Xianyuan Zhan, Amy Zhang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate Uni-RL on a range of standard RL benchmarks across online, offline, and offline-to-online settings. In online RL, Uni-RL achieves higher sample efficiency than both off-policy methods without trust-region updates and on-policy methods with trust-region updates. In offline RL, Uni-RL retains the benefits of in-sample learning while outperforming IVR through better policy extraction. In offline-to-online RL, Uni-RL beats both constraint-based methods and unconstrained approaches by effectively balancing stability and adaptability. |
| Researcher Affiliation | Academia | 1University of Texas at Austin 2Shanghai Jiao Tong University 3Nanyang Technological University 4Tsinghua University |
| Pseudocode | Yes | Algorithm 1 Unified Implicit Value Regularization |
| Open Source Code | Yes | Code: https://github.com/ryanxhr/Uni-RL |
| Open Datasets | Yes | We evaluate Uni-RL on 6 widely used RL benchmarks and 23 environments across online, offline, and offline-to-online settings... Online RL Gym, Py Bullet, DMControl... Offline RL D4RL Mu Jo Co and Antmaze... Offline-to-online RL D4RL Antmaze, Kitchen and Adroit. |
| Dataset Splits | Yes | Offline RL. In the offline setting, we evaluate Uni-RL on the D4RL benchmark (Fu et al., 2020) and compare it with several state-of-the-art algorithms... For Mu Jo Co environments, we have the following datasets. halfcheetah/hopper/walker2d-m (medium): Collected by a policy with moderate performance, typically reaching around one-third of expert returns. These datasets represent structured but suboptimal behavior. halfcheetah/hopper/walker2d-m-r (medium-replay): Contains the replay buffer of the mediocre SAC policy. It includes a wide range of off-policy transitions, many of which are suboptimal or noisy. halfcheetah/hopper/walker2d-m-e (medium-expert): A 50-50 mixture of medium and expert trajectories. These datasets are designed to test whether algorithms can leverage nearoptimal data when it is partially present. |
| Hardware Specification | No | This research used the computational cluster resource provided by the Texas Advanced Computing Center at UT Austin. |
| Software Dependencies | No | We implemented Uni-RL using Py Torch and ran it on all datasets... In online experiments, we run baselines using the implementation from ACME (Hoffman et al., 2020)... In offline experiments, baseline results for other methods were directly sourced from their respective papers. In offline-to-online experiments, we run baselines using the pytorch implementation from CORL (Tarasov et al., 2024b). |
| Experiment Setup | Yes | For the network, we use 3-layer MLP with 256 hidden units and Adam optimizer (Kingma and Ba, 2015) with a learning rate of 1 x 10^-3 for both policy and value functions in all tasks. We also use a target network with soft update weight 5 x 10^-3 for Q-function. We clip the output of the weight function by max(w(x), 0) to ensure a non-negative BC weight. We use 0-1 normalization to the weight in each batch and then clip it to [wmin, wmax] where we set wmin to 0.1 and wmax to 0.9 through all the datasets... We search α over [0.1, 0.5, 1.0, 2.0] and λ over [0.01, 0.05, 0.1, 0.2] The best value of α and λ for all environments are listed in Table 3. |