Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Safety-Prioritizing Curricula for Constrained Reinforcement Learning

Authors: Cevahir Koprulu, Thiago Simão, Nils Jansen, ufuk topcu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically show that compared to the state-of-the-art curriculum learning approaches and their naively modified safe versions, SCG achieves optimal performance and the lowest amount of constraint violations during training.6 EMPIRICAL ANALYSIS Our experiments in constrained RL domains investigate the following questions:
Researcher Affiliation Academia Cevahir Koprulu1 Thiago D. Simão2 Nils Jansen3 Ufuk Topcu1 1The University of Texas at Austin 2Eindhoven University of Technology 3Ruhr-University Bochum Correspondence to: Cevahir Koprulu (EMAIL).
Pseudocode Yes Algorithm 1 Safe Curriculum Generation (SCG) Input: Target and initial context distributions φ and ϱ0 Parameters: Safety threshold D, cost threshold D, performance threshold ζ, Wasserstein distance bound ϵ, number of curriculum iterations K, number of rollouts per iteration M, buffer size N Output: Final policy πK
Open Source Code Yes We put the core code of SCG in the supplementary details. The code includes instructions to install necessary software, reproduce the experiments and target contexts where we evaluate trained policies to obtain the experimental results.
Open Datasets Yes We consider 3 constrained RL domains: safety-maze, safety-goal, and safety-push (Figures 2 and 5a). In all domains, the agent aims to avoid hazards and reach a goal in the presence of misalignment phenomena. We study safety-maze to showcase that a simple modification to an existing domain (see Section 3.1) can trigger misaligned objectives. In comparison, safety-goal and safety-push are navigation tasks with realistic sensory observations in Safety-Gymnasium (Ji et al., 2023), a framework extensively used for constrained RL.
Dataset Splits Yes Given a CCMDP M and a target context distribution φ, i.e., a probability simplex (X), contextual constrained RL aims to maximize expected return subject to a cost constraint: The target context distribution is uniform over the top white horizontal area. The target context distributions are uniform distributions over goals placed in the free space on the top row inside the walls/pillars.
Hardware Specification Yes We run our experiments on a cluster with NVIDIA RTX A5000 GPUs and an Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz.
Software Dependencies No We utilize Omnisafe Ji et al. (2024) as our RL framework, which uses 16 torch threads and no parallel environments in our experiments.
Experiment Setup Yes For all the hyperparameters and detailed settings of the experiments, please refer to Appendix D. Table 2: Parameters used for SCG, CURROT, NAIVESAFECURROT, and CURROT4COST. The parameters of the PPO-Lagrangian are fixed to their default values in Omni Safe, except the number of steps to update the policy is 4000 and the number of iterations to update the policy is 12.