Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Safety-Prioritizing Curricula for Constrained Reinforcement Learning
Authors: Cevahir Koprulu, Thiago Simão, Nils Jansen, ufuk topcu
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically show that compared to the state-of-the-art curriculum learning approaches and their naively modified safe versions, SCG achieves optimal performance and the lowest amount of constraint violations during training.6 EMPIRICAL ANALYSIS Our experiments in constrained RL domains investigate the following questions: |
| Researcher Affiliation | Academia | Cevahir Koprulu1 Thiago D. Simão2 Nils Jansen3 Ufuk Topcu1 1The University of Texas at Austin 2Eindhoven University of Technology 3Ruhr-University Bochum Correspondence to: Cevahir Koprulu (EMAIL). |
| Pseudocode | Yes | Algorithm 1 Safe Curriculum Generation (SCG) Input: Target and initial context distributions φ and ϱ0 Parameters: Safety threshold D, cost threshold D, performance threshold ζ, Wasserstein distance bound ϵ, number of curriculum iterations K, number of rollouts per iteration M, buffer size N Output: Final policy πK |
| Open Source Code | Yes | We put the core code of SCG in the supplementary details. The code includes instructions to install necessary software, reproduce the experiments and target contexts where we evaluate trained policies to obtain the experimental results. |
| Open Datasets | Yes | We consider 3 constrained RL domains: safety-maze, safety-goal, and safety-push (Figures 2 and 5a). In all domains, the agent aims to avoid hazards and reach a goal in the presence of misalignment phenomena. We study safety-maze to showcase that a simple modification to an existing domain (see Section 3.1) can trigger misaligned objectives. In comparison, safety-goal and safety-push are navigation tasks with realistic sensory observations in Safety-Gymnasium (Ji et al., 2023), a framework extensively used for constrained RL. |
| Dataset Splits | Yes | Given a CCMDP M and a target context distribution φ, i.e., a probability simplex (X), contextual constrained RL aims to maximize expected return subject to a cost constraint: The target context distribution is uniform over the top white horizontal area. The target context distributions are uniform distributions over goals placed in the free space on the top row inside the walls/pillars. |
| Hardware Specification | Yes | We run our experiments on a cluster with NVIDIA RTX A5000 GPUs and an Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz. |
| Software Dependencies | No | We utilize Omnisafe Ji et al. (2024) as our RL framework, which uses 16 torch threads and no parallel environments in our experiments. |
| Experiment Setup | Yes | For all the hyperparameters and detailed settings of the experiments, please refer to Appendix D. Table 2: Parameters used for SCG, CURROT, NAIVESAFECURROT, and CURROT4COST. The parameters of the PPO-Lagrangian are fixed to their default values in Omni Safe, except the number of steps to update the policy is 4000 and the number of iterations to update the policy is 12. |