Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Certifying Stability of Reinforcement Learning Policies using Generalized Lyapunov Functions
Authors: Kehan Long, Jorge Cortes, Nikolay Atanasov
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our approach successfully certifies the stability of RL policies trained on Gymnasium and Deep Mind Control benchmarks. We also extend our method to jointly train neural controllers and stability certificates using a multi-step Lyapunov loss, resulting in larger certified inner approximations of the region of attraction compared to the classical Lyapunov approach. Empirically, our formulation certifies policies on Gymnasium [Towers et al., 2024] and Deep Mind Control Suite [Tassa et al., 2018] benchmarks where classical single-step Lyapunov methods fail. |
| Researcher Affiliation | Academia | Kehan Long Jorge Cortés Nikolay Atanasov Contextual Robotics Institute University of California San Diego EMAIL |
| Pseudocode | No | The paper describes methods and algorithms using mathematical formulations and textual descriptions, but it does not include any explicitly labeled pseudocode blocks or algorithms. |
| Open Source Code | Yes | We provide an open-source implementation of our method at https://github.com/ Existential Robotics/Generalized_Policy_Stability. |
| Open Datasets | Yes | We evaluate our method on two standard RL control benchmarks from Gymnasium [Towers et al., 2024] and the Deep Mind Control Suite [Tassa et al., 2018]. |
| Dataset Splits | No | The paper describes generating rollout trajectories by simulating the closed-loop system under πRL with randomly sampled initial states for training. For evaluation, it states, "we quantitatively evaluate our learned certificates by sampling Ntest states from the full state space and checking whether (14) is satisfied." While it specifies how data is generated for training and testing, it does not provide traditional, fixed training/test/validation splits of a pre-existing dataset. |
| Hardware Specification | Yes | All experiments were run on a single workstation with an NVIDIA RTX 4090 GPU, AMD Ryzen 9 7950X CPU, and 64 GB RAM. |
| Software Dependencies | No | The paper mentions using specific implementations like "Raffin et al. [2021]" (Stable-Baselines3) and "Hansen et al. [2022]" (TD-MPC), and verification tools like "α-β-CROWN verifier [Wang et al., 2021]" and the "Adam optimizer". However, it does not specify exact version numbers for these software components or programming languages, which is required for a reproducible description of ancillary software dependencies. |
| Experiment Setup | Yes | We train all networks using the Adam optimizer with an initial learning rate of 5×10−4, a ReduceLROnPlateau scheduler (factor 0.5, patience 500), and a batch size of 256 for 1000 epochs. Gradients are clipped at 5.0. We set the decay parameter to α = 0.02 and the slack to β = 0.01. As discussed in Remark 5.2, we exclude a small ball around the origin during both training and evaluation. Specifically, we set δ = 0.05 for the inverted pendulum and δ = 0.5 for the cartpole. All systems are discretized using explicit Euler integration with a time step of t = 0.05 s. We use fixed step weights (σ1, σ2, . . . ) selected via grid search: for the inverted pendulum, (0.4, 1.6) for M=2 and (0.3, 1.5, 1.2) for M=3; for path tracking, (0.4, 1.6) and (1.2, 1.2, 0.6), respectively. |