Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Compute-Optimal Scaling for Value-Based Deep RL

Authors: Preston Fu, Oleh Rybkin, Zhiyuan (Paul) Zhou, Michal Nauman, Pieter Abbeel, Sergey Levine, Aviral Kumar

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our analysis reveals a nuanced interplay between model size, batch size, and UTD. In particular, we identify a phenomenon we call TD-overfitting: increasing the batch quickly harms Q-function accuracy for small models, but this effect is absent in large models, enabling effective use of large batch size at scale. We provide a mental model for understanding this phenomenon and build guidelines for choosing batch size and UTD to optimize compute usage. Our findings provide a grounded starting point for compute-optimal scaling in deep RL, mirroring studies in supervised learning but adapted to TD learning. Project page: value-scaling.github.io.
Researcher Affiliation	Academia	1UC Berkeley 2University of Warsaw 3Carnegie Mellon University
Pseudocode	Yes	Algorithm 1 Training loop drop-ins for any value-based algorithm
Open Source Code	Yes	Code: github.com/prestonfu/model scaling.
Open Datasets	Yes	For our initial study, we leverage the results from [34] on Deepmind Control suite [50]. Following prior work [14, 34], we separate these into 7 medium difficulty tasks (referred to as DMC-medium) and 6 hard difficulty tasks (DMC-hard). For these tasks, we fit averages of the tasks for the two suites respectively, building upon the protocol prescribed in Rybkin et al. [39], to show generalization of our fits across tasks. We evaluate scaling on 4 more difficult tasks from DMC and Humanoid Bench [42], where we make fits for each task individually to show applicability to single tasks.
Dataset Splits	Yes	We construct a held-out dataset of transitions following the same distribution as the training replay buffer. To do so, we create a validation environment, which is identical to the training environment with a different random seed, and a corresponding validation replay buffer. This allows us to measure the validation TD-error, i.e. the TD-error of the critic against the target on data sampled from the validation replay buffer.
Hardware Specification	Yes	We thank the TRC program at Google Cloud for providing TPU sources that supported this work. We thank NCSA Delta cluster for providing GPU resources that supported the experiments in this work.
Software Dependencies	No	The paper mentions using BRO [34] and Simba V2 [26] algorithms, but does not specify software versions for programming languages, libraries, or frameworks (e.g., Python, PyTorch, CUDA versions) used for their implementation.
Experiment Setup	Yes	We use BRO [34] and Simba V2 [26], approaches based on SAC [13]... Thus, to study the impact of model size, we vary only the network width in {256, 512, 1024, 2048, 4096}. We consider batch sizes from 4 to 4096 (varied in powers of 2 and UTD ratios of 1, 2, 4, 8. We keep other hyperparameters fixed across all tasks at values suggested by Nauman et al. [34]... Due to the computational requirements of running a large grid search for obtaining the scaling fits, we use a constant network depth (2 Bro Net blocks [34]) and learning rate (3e-4) throughout our experiments and run at least 5 random seeds per configuration.