Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Actor-Free Continuous Control via Structurally Maximizable Q-Functions
Authors: Yigit Korkmaz, Urvi Bhuwania, Ayush Jain, Erdem Bıyık
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the proposed actor-free Q-learning approach on a range of standard simulation tasks, demonstrating performance and sample-efficiency on par with state-of-the-art baselines, without the cost of learning a separate actor. |
| Researcher Affiliation | Collaboration | Yigit Korkmaz1 * Urvi Bhuwania1 * Ayush Jain1, 2 Erdem Bıyık1 1Thomas Lord Department of Computer Science, University of Southern California, 2Meta AI |
| Pseudocode | Yes | Algorithm 1 Q3C |
| Open Source Code | Yes | We have released our code at https://github.com/USC-Lira/Q3C. |
| Open Datasets | Yes | We evaluate our method on several tasks from the Gymnasium suite [50]. Specifically, we use Pendulum, Swimmer, Hopper, Bipedal Walker, Walker2d, Half Cheetah, and Ant tasks to cover a range of task difficulty. In all experiments, we use state-based observations and do not modify the reward function of the tasks. In addition, similar to prior work [24, 39], we create restricted versions of a subset of the environments, namely Inverted Pendulum, Hopper, and Half Cheetah. |
| Dataset Splits | No | RL typically involves interacting with an environment to generate data, rather than using a static pre-split dataset. While the paper describes evaluation protocols (e.g., 'evaluate each method every 10000 steps by running 10 rollout episodes and report the average return.'), it does not explicitly define how a fixed dataset is partitioned into training, validation, or test sets in the traditional sense of supervised learning. |
| Hardware Specification | Yes | Experiments were run with 5 random seeds each: restricted environments on Tesla P100 GPUs and unrestricted environments on A100 GPUs. |
| Software Dependencies | No | The paper mentions 'stable-baselines3' [37] and 'rlzoo3' [36] as implementations and hyperparameter sources but does not provide specific version numbers for these or other key software components like Python, PyTorch, or CUDA within the main text or appendices. |
| Experiment Setup | Yes | Q3C adopts its hyperparameters from its underlying implementation of TD3, but we tune certain important hyperparameters such as learning rate and learning starts. Furthermore, we tune the hyperparameters specific to Q3C such as the number of control-points, the number of nearest neighbors k, separation loss weight, and initial smoothing value. Table 9: Q3C Hyperparameters (Part 1 of 2) Table 10: Q3C Hyperparameters (Part 2 of 2) Table 11: Q3C Hyperparameters for Restricted Environments |