Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Scaling Offline RL via Efficient and Expressive Shortcut Models
Authors: Nicolas Espinosa-Dice, Yiyi Zhang, Yiding Chen, Bradley Guo, Owen Oertell, Gokul Swamy, Kianté Brantley, Wen Sun
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that SORL achieves strong performance across a range of offline RL tasks and exhibits positive scaling behavior with increased test-time compute. We present SORL s overall performance across a range of environments in Table 1. Notably, SORL achieves the best performance on 5 out of 8 environments, including substantial improvements over the baselines on antmaze-large and antsoccer-arena. |
| Researcher Affiliation | Academia | Nicolas Espinosa-Dice Cornell University EMAIL Yiyi Zhang Cornell University EMAIL Yiding Chen Cornell University EMAIL Bradley Guo Cornell University EMAIL Owen Oertell Cornell University EMAIL Gokul Swamy Carnegie Mellon University EMAIL Kianté Brantley Harvard University EMAIL Wen Sun Cornell University EMAIL |
| Pseudocode | Yes | Algorithm 1: Scalable Offline Reinforcement Learning (SORL) Data: Offline dataset D while not converged do Sample (x, a1, x , r) D, a0 N(0, I), (h, t) p(h, t) # Parallelize batch at (1 t)a0 + ta1 # Noise action |
| Open Source Code | Yes | We release the code at nico-espinosadice.github.io/projects/sorl. Answer: [Yes] Justification: Code is included in the supplementary material. |
| Open Datasets | Yes | We evaluate SORL on locomotion and manipulation robotics tasks in the OGBench task suite [Park et al., 2024a]. Answer: [Yes] Justification: This paper uses the open-source dataset OGBench [Park et al., 2024a]. |
| Dataset Splits | Yes | We follow the standard dataset protocols (navigate for locomotion, play for manipulation) and use OGBench s reward-based singletask variants for all experiments [Park et al., 2024a], which are best suited for reward-maximizing RL. Each OGBench environment offers five unique tasks, each associated with a specific evaluation goal, denoted by suffixes singletask-task1 through -task5. We utilize all five tasks for each environment. |
| Hardware Specification | Yes | The experiments were performed on a Nvidia RTX 3090 GPU. |
| Software Dependencies | No | We use a multi-layer perceptron with 4 hidden layers of size 512 for both the value and policy networks. We apply layer normalization [Ba et al., 2016] to value networks. We use the Adam optimizer [Kingma, 2014], which we add gradient clipping to. |
| Experiment Setup | Yes | We use a multi-layer perceptron with 4 hidden layers of size 512 for both the value and policy networks. We train algorithms for 1,000,000 gradient steps and evaluate 50 episodes every 100,000 gradient steps. MINIBATCH SIZE 256, LEARNING RATE 1E-4, GRADIENT CLIPPING NORM 1, DISCOUNT FACTOR γ 0.99 (default), 0.995 (antmaze-giant, humanoidmaze, antsoccer), BC COEFFICIENT 10, SELF-CONSISTENCY COEFFICIENT 10, and varying Q-LOSS COEFFICIENT (e.g., 500 for antmaze-large, 10 for cube-single-play). |