Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Scaling Offline RL via Efficient and Expressive Shortcut Models

Authors: Nicolas Espinosa-Dice, Yiyi Zhang, Yiding Chen, Bradley Guo, Owen Oertell, Gokul Swamy, Kianté Brantley, Wen Sun

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate that SORL achieves strong performance across a range of ofﬂine RL tasks and exhibits positive scaling behavior with increased test-time compute. We present SORL s overall performance across a range of environments in Table 1. Notably, SORL achieves the best performance on 5 out of 8 environments, including substantial improvements over the baselines on antmaze-large and antsoccer-arena.
Researcher Affiliation	Academia	Nicolas Espinosa-Dice Cornell University EMAIL Yiyi Zhang Cornell University EMAIL Yiding Chen Cornell University EMAIL Bradley Guo Cornell University EMAIL Owen Oertell Cornell University EMAIL Gokul Swamy Carnegie Mellon University EMAIL Kianté Brantley Harvard University EMAIL Wen Sun Cornell University EMAIL
Pseudocode	Yes	Algorithm 1: Scalable Ofﬂine Reinforcement Learning (SORL) Data: Ofﬂine dataset D while not converged do Sample (x, a1, x , r) D, a0 N(0, I), (h, t) p(h, t) # Parallelize batch at (1 t)a0 + ta1 # Noise action
Open Source Code	Yes	We release the code at nico-espinosadice.github.io/projects/sorl. Answer: [Yes] Justiﬁcation: Code is included in the supplementary material.
Open Datasets	Yes	We evaluate SORL on locomotion and manipulation robotics tasks in the OGBench task suite [Park et al., 2024a]. Answer: [Yes] Justiﬁcation: This paper uses the open-source dataset OGBench [Park et al., 2024a].
Dataset Splits	Yes	We follow the standard dataset protocols (navigate for locomotion, play for manipulation) and use OGBench s reward-based singletask variants for all experiments [Park et al., 2024a], which are best suited for reward-maximizing RL. Each OGBench environment offers ﬁve unique tasks, each associated with a speciﬁc evaluation goal, denoted by sufﬁxes singletask-task1 through -task5. We utilize all ﬁve tasks for each environment.
Hardware Specification	Yes	The experiments were performed on a Nvidia RTX 3090 GPU.
Software Dependencies	No	We use a multi-layer perceptron with 4 hidden layers of size 512 for both the value and policy networks. We apply layer normalization [Ba et al., 2016] to value networks. We use the Adam optimizer [Kingma, 2014], which we add gradient clipping to.
Experiment Setup	Yes	We use a multi-layer perceptron with 4 hidden layers of size 512 for both the value and policy networks. We train algorithms for 1,000,000 gradient steps and evaluate 50 episodes every 100,000 gradient steps. MINIBATCH SIZE 256, LEARNING RATE 1E-4, GRADIENT CLIPPING NORM 1, DISCOUNT FACTOR γ 0.99 (default), 0.995 (antmaze-giant, humanoidmaze, antsoccer), BC COEFFICIENT 10, SELF-CONSISTENCY COEFFICIENT 10, and varying Q-LOSS COEFFICIENT (e.g., 500 for antmaze-large, 10 for cube-single-play).