Learning Control Policies for Stochastic Systems with Reach-Avoid Guarantees
Authors: Đorđe Žikelić, Mathias Lechner, Thomas A. Henzinger, Krishnendu Chatterjee
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our approach on 3 stochastic non-linear reinforcement learning tasks. and the entire Experiments section with tables. |
| Researcher Affiliation | Academia | 1Institute of Science and Technology Austria (ISTA) 2Massachusetts Institute of Technology (MIT) djordje.zikelic@ist.ac.at, mlechner@mit.edu, tah@ist.ac.at, krishnendu.chatterjee@ist.ac.at |
| Pseudocode | Yes | Algorithm 1: Algorithm for learning reach-avoid policies |
| Open Source Code | Yes | 1Code is available at https://github.com/mlech26l/neural_martingales |
| Open Datasets | Yes | Our first two environments are a linear 2D system with nonlinear control bounds and the stochastic inverted pendulum control problem. The inverted pendulum environment is taken from the Open AI Gym (Brockman et al. 2016)... Our third environment concerns a collision avoidance task. |
| Dataset Splits | No | The paper conducts experiments in reinforcement learning environments, where data is generated dynamically, and does not specify traditional train/validation/test dataset splits with percentages or sample counts. |
| Hardware Specification | No | The paper does not provide specific details regarding the hardware used for running the experiments. |
| Software Dependencies | No | The paper mentions software components like PPO and Open AI Gym but does not provide specific version numbers for any libraries, frameworks, or solvers used in the implementation. |
| Experiment Setup | Yes | The policy and RASM networks consist of two hidden layers (128 units each, Re LU). The RASM network has a single output unit with a softplus activation. We run our algorithm with a timeout of 3 hours. For all tasks, we pre-train the policy networks using 100 iterations of PPO. and mentions parameters: Parameters mesh τ > 0, number of samples N N, regularization constant λ > 0 |