Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Actor-Free Continuous Control via Structurally Maximizable Q-Functions

Authors: Yigit Korkmaz, Urvi Bhuwania, Ayush Jain, Erdem Bıyık

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the proposed actor-free Q-learning approach on a range of standard simulation tasks, demonstrating performance and sample-efficiency on par with state-of-the-art baselines, without the cost of learning a separate actor.
Researcher Affiliation Collaboration Yigit Korkmaz1 * Urvi Bhuwania1 * Ayush Jain1, 2 Erdem Bıyık1 1Thomas Lord Department of Computer Science, University of Southern California, 2Meta AI
Pseudocode Yes Algorithm 1 Q3C
Open Source Code Yes We have released our code at https://github.com/USC-Lira/Q3C.
Open Datasets Yes We evaluate our method on several tasks from the Gymnasium suite [50]. Specifically, we use Pendulum, Swimmer, Hopper, Bipedal Walker, Walker2d, Half Cheetah, and Ant tasks to cover a range of task difficulty. In all experiments, we use state-based observations and do not modify the reward function of the tasks. In addition, similar to prior work [24, 39], we create restricted versions of a subset of the environments, namely Inverted Pendulum, Hopper, and Half Cheetah.
Dataset Splits No RL typically involves interacting with an environment to generate data, rather than using a static pre-split dataset. While the paper describes evaluation protocols (e.g., 'evaluate each method every 10000 steps by running 10 rollout episodes and report the average return.'), it does not explicitly define how a fixed dataset is partitioned into training, validation, or test sets in the traditional sense of supervised learning.
Hardware Specification Yes Experiments were run with 5 random seeds each: restricted environments on Tesla P100 GPUs and unrestricted environments on A100 GPUs.
Software Dependencies No The paper mentions 'stable-baselines3' [37] and 'rlzoo3' [36] as implementations and hyperparameter sources but does not provide specific version numbers for these or other key software components like Python, PyTorch, or CUDA within the main text or appendices.
Experiment Setup Yes Q3C adopts its hyperparameters from its underlying implementation of TD3, but we tune certain important hyperparameters such as learning rate and learning starts. Furthermore, we tune the hyperparameters specific to Q3C such as the number of control-points, the number of nearest neighbors k, separation loss weight, and initial smoothing value. Table 9: Q3C Hyperparameters (Part 1 of 2) Table 10: Q3C Hyperparameters (Part 2 of 2) Table 11: Q3C Hyperparameters for Restricted Environments