Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Automaton Constrained Q-Learning

Authors: Anastasios Manganaris, Vittorio Giammarino, Ahmed Qureshi

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that ACQL outperforms existing methods across a range of continuous control tasks, including cases where prior methods fail to satisfy either goal-reaching or safety constraints. We further validate its realworld applicability by deploying ACQL on a 6-DOF robotic arm performing a goal-reaching task in a cluttered, cabinet-like space with safety constraints. Our results demonstrate that ACQL is a robust and scalable solution for learning robotic behaviors according to rich temporal specifications.
Researcher Affiliation Academia Anastasios Manganaris, Vittorio Giammarino, and Ahmed H. Qureshi Department of Computer Science Purdue University EMAIL
Pseudocode Yes Algorithm 1 Automaton Constrained Q-Learning
Open Source Code Yes We provide open access to anonymized code, along with scripts and an exact description of all required dependencies with version information, needed for running all of our experiments. ... All further details regarding environment geometry and task definitions for our simulated and real-world experiments are included in our Code Repository1. 1https://github.com/Tass0sm/acql
Open Datasets No All environment simulation was done within the Brax physics simulator [57] and assets provided in Jax GCRL [62] for the Point Mass, Quadcopter, and Ant environments. ... Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: This research (1) does not involve human subjects or participants, (2) does not involve any datasets...
Dataset Splits No All results are reported for the final policy obtained after 5 million environment interactions with five different seeds. For each seed, we evaluated the final policy in 16 randomly initialized episodes lasting 1000 steps.
Hardware Specification Yes All experiments were conducted on a single NVIDIA RTX 3090 GPU (24 GB VRAM), using a local workstation equipped with an 12th Gen Intel i7-12700F CPU, 32 GB RAM.
Software Dependencies Yes Our code uses the following libraries: (1) Brax (Apache License 2.0), (2) Mujoco Menagerie (MIT), (3) Jax GCRL (Apache 2.0), (4) Spot (GPLv3). Our paper cites these when discussing implementation details in the appendix.
Experiment Setup Yes Table 3 reports the hyperparameter values most commonly used in our experiments, including hyperparameters for the safety gamma (γc) scheduler described in Appendix B. For a complete account of hyperparameters, as well as ACQL, baseline, and environment implementation details, refer to our Code Repository21. Table 3: Hyperparameter values used for experiments in Tables 1 and 2 in our main paper Hyperparameter Name Value Episode length (T) 1000 Discount factor (γ) 0.99 Learning rate (α) 1 10 4 ϵ-greedy factor (ϵ) 0.1 Safety limit (L) 0.0 Safety Gamma Init Value 0.80 Safety Gamma Update Period 250, 000 Safety Gamma Decay Rate 0.15 Safety Gamma Max. Value 0.98 Target parameter interpolation factor (λ) 0.005