Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Compute-Optimal Scaling for Value-Based Deep RL

Authors: Preston Fu, Oleh Rybkin, Zhiyuan (Paul) Zhou, Michal Nauman, Pieter Abbeel, Sergey Levine, Aviral Kumar

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our analysis reveals a nuanced interplay between model size, batch size, and UTD. In particular, we identify a phenomenon we call TD-overfitting: increasing the batch quickly harms Q-function accuracy for small models, but this effect is absent in large models, enabling effective use of large batch size at scale. We provide a mental model for understanding this phenomenon and build guidelines for choosing batch size and UTD to optimize compute usage. Our findings provide a grounded starting point for compute-optimal scaling in deep RL, mirroring studies in supervised learning but adapted to TD learning. Project page: value-scaling.github.io.
Researcher Affiliation Academia 1UC Berkeley 2University of Warsaw 3Carnegie Mellon University
Pseudocode Yes Algorithm 1 Training loop drop-ins for any value-based algorithm
Open Source Code Yes Code: github.com/prestonfu/model scaling.
Open Datasets Yes For our initial study, we leverage the results from [34] on Deepmind Control suite [50]. Following prior work [14, 34], we separate these into 7 medium difficulty tasks (referred to as DMC-medium) and 6 hard difficulty tasks (DMC-hard). For these tasks, we fit averages of the tasks for the two suites respectively, building upon the protocol prescribed in Rybkin et al. [39], to show generalization of our fits across tasks. We evaluate scaling on 4 more difficult tasks from DMC and Humanoid Bench [42], where we make fits for each task individually to show applicability to single tasks.
Dataset Splits Yes We construct a held-out dataset of transitions following the same distribution as the training replay buffer. To do so, we create a validation environment, which is identical to the training environment with a different random seed, and a corresponding validation replay buffer. This allows us to measure the validation TD-error, i.e. the TD-error of the critic against the target on data sampled from the validation replay buffer.
Hardware Specification Yes We thank the TRC program at Google Cloud for providing TPU sources that supported this work. We thank NCSA Delta cluster for providing GPU resources that supported the experiments in this work.
Software Dependencies No The paper mentions using BRO [34] and Simba V2 [26] algorithms, but does not specify software versions for programming languages, libraries, or frameworks (e.g., Python, PyTorch, CUDA versions) used for their implementation.
Experiment Setup Yes We use BRO [34] and Simba V2 [26], approaches based on SAC [13]... Thus, to study the impact of model size, we vary only the network width in {256, 512, 1024, 2048, 4096}. We consider batch sizes from 4 to 4096 (varied in powers of 2 and UTD ratios of 1, 2, 4, 8. We keep other hyperparameters fixed across all tasks at values suggested by Nauman et al. [34]... Due to the computational requirements of running a large grid search for obtaining the scaling fits, we use a constant network depth (2 Bro Net blocks [34]) and learning rate (3e-4) throughout our experiments and run at least 5 random seeds per configuration.