Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Horizon Reduction Makes RL Scalable

Authors: Seohong Park, Kevin Frans, Deepinder Mann, Benjamin Eysenbach, Aviral Kumar, Sergey Levine

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we study the scalability of offline reinforcement learning (RL) algorithms. ... We empirically verify this hypothesis through several analysis experiments, showing that long horizons indeed present a fundamental barrier to scaling up offline RL. ... To answer this question, we generate large-scale datasets for tasks that require highly complex, long-horizon reasoning, and study how current offline RL algorithms scale with data. Specifically, on complex simulated robotics tasks across diverse domains in OGBench [75], we collect a dataset with up to one billion transitions for each environment... In these controlled yet challenging environments, we evaluate the performance of state-of-the-art offline RL algorithms while varying the amount of data. We observe that many existing offline RL algorithms struggle to scale, even with orders of magnitude more data in these idealized environments. ... Their performance often saturates far below the maximum possible performance (Figure 1), especially on complex, long-horizon tasks, suggesting that there exist scalability challenges in offline RL.
Researcher Affiliation Academia 1University of California, Berkeley 2Princeton University 3Carnegie Mellon University
Pseudocode Yes We provide the pseudocode for SHARSA and double SHARSA in Algorithms 1 and 2.
Open Source Code Yes Code: https://github.com/seohongpark/horizon-reduction
Open Datasets Yes To facilitate this, we open-source our tasks, datasets, and implementations (link), where we have made them as easy to use as possible. ... We employ four highly challenging offline goal-conditioned RL tasks in robotics from the OGBench task suite [75].
Dataset Splits No The paper mentions generating large-scale datasets (up to 1B transitions) and evaluating performance using '15 rollouts on each of the 5 (4 for cube-double) evaluation goals'. It also mentions training for '5M gradient steps'. However, it does not explicitly specify a conventional training, validation, and test split of the collected datasets in terms of percentages or specific sample counts for the main learning process. The evaluation goals define what is tested, but not how the entire dataset used for training is formally partitioned for generalization analysis.
Hardware Specification Yes Each run in this work takes no more than three days on a single A5000 GPU. ... This research used the Savio computational cluster resource provided by the Berkeley Research Computing program at UC Berkeley.
Software Dependencies No The paper mentions using specific optimizers like Adam [46] and nonlinearities like GELU [34], and states that methods are implemented 'on top of the reference implementations of OGBench [75]'. However, it does not provide specific version numbers for key software components such as Python, deep learning frameworks (e.g., PyTorch, TensorFlow), or other libraries, which are necessary for full reproducibility of the software environment.
Experiment Setup Yes We train each offline RL algorithm for 5M gradient steps (2.5M steps for simpler tasks in Figure 10) and evaluate every 250K steps. ... The hyperparameters (in particular, the degree of behavioral regularization) of each algorithm are individually tuned on each task based on the largest 1B datasets. We provide the full list of hyperparameters in Tables 3 and 4... Table 3: Common hyperparameters for OGBench experiments. Hyperparameter Value Gradient steps 5M (default), 2.5M (cube-double, puzzle-4x4) Optimizer Adam [46] Learning rate 0.0003 Batch size 1024 MLP size [1024, 1024, 1024, 1024] Nonlinearity GELU [34] Layer normalization True Target network update rate 0.005 Discount factor γ 0.999 (default), 0.99 (cube-double, puzzle-4x4) Flow steps 10 Horizon reduction factor n 50 (cube, humanoidmaze), 25 (puzzle) Expectile κ (IQL) 0.9 Expectile κ (HIQL) 0.5 (cube, humanoidmaze), 0.7 (puzzle) Value representation dimensionality k (CRL) 1024 Goal representation dimensionality k (HIQL) 128 Double Q aggregation (SAC+BC, SHARSA, FQL) min(Q1, Q2) (cube, puzzle), (Q1 + Q2)/2 (humanoidmaze) Value loss type (SAC+BC, SHARSA, FQL) Binary cross entropy Actor (p D cur, p D geom, p D traj, p D rand) ratio (BC) (0, 1, 0, 0) Actor (p D cur, p D geom, p D traj, p D rand) ratio (others) (0, 1, 0, 0) (cube), (0.5, 0.5, 0, 0) (puzzle), (0, 0, 1, 0) (humanoidmaze) Value (p D cur, p D geom, p D traj, p D rand) ratio (CRL) (0, 1, 0, 0) Value (p D cur, p D geom, p D traj, p D rand) ratio (others) (0.2, 0, 0.5, 0.3) Policy extraction hyperparameters Table 4