Flipping-based Policy for Chance-Constrained Markov Decision Processes

Authors: Xun Shen, Shuo Jiang, Akifumi Wachi, Kazumune Hashimoto, Sebastien Gros

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5 Experiments We conduct a numerical example to illustrate how the flipping-based policy outperforms the deterministic policy in CCMDPs. ... We conduct experiments on Safety Gym [34], where an agent must maximize the expected cumulative reward under a safety constraint with additive structures.
Researcher Affiliation Collaboration Xun Shen Osaka University shenxun@eei.eng.osaka-u.ac.jp Shuo Jiang Osaka University u316354h@ecs.osaka-u.ac.jp Akifumi Wachi LY Corporation akifumi.wachi@lycorp.co.jp Kazumune Hashimoto Osaka University hashimoto@eei.eng.osaka-u.ac.jp Sebastien Gros Norwegian University of Science and Technology sebastien.gros@ntnu.no
Pseudocode Yes Algorithm 1 General training algorithm for flipping-based policy
Open Source Code Yes We have provided the code in the supplemental material.
Open Datasets Yes We conduct experiments on Safety Gym [34], where an agent must maximize the expected cumulative reward under a safety constraint with additive structures. ... Ray, A., Achiam, J., and Amodei, D. (2019). Benchmarking safe exploration in deep reinforcement learning. Open AI.
Dataset Splits No The paper describes a training and testing process involving randomly generated initial states and goal points for testing, but it does not specify explicit dataset splits (e.g., percentages or sample counts) for training, validation, or testing.
Hardware Specification Yes We used a machine with Intel(R) Core(TM) i7-14700 CPU, 32GB RAM, and NVIDIA 4060 GPU.
Software Dependencies No The infrastructural framework for performing safe RL algorithms is Omni Safe [24]. No specific version numbers for software dependencies are provided.
Experiment Setup Yes Table 1: Hyper-parameters for Safety Gym experiments. ... COMMON PARAMETERS NETWORK ARCHITECTURE [64, 64] ACTIVATION FUNCTION tanh LEARNING RATE (CRITIC) 2 10 4 LEARNING RATE (POLICY) 3 10 3 LEARNING RATE (PENALTY) 0.0 DISCOUNT FACTOR (REWARD) 0.99 DISCOUNT FACTOR (SAFETY) 0.995 STEPS PER EPOCH 40, 000 NUMBER OF CONJUGATE GRADIENT ITERATIONS 20 NUMBER OF ITERATIONS TO UPDATE THE POLICY 10 NUMBER OF EPOCHS 500 TARGET KL 0.01 BATCH SIZE FOR EACH ITERATION 1024 DAMPING COEFFICIENT 0.1 CRITIC NORM COEFFICIENT 0.001 STD UPPER BOUND, AND LOWER BOUND [0.425, 0.125] LINEAR LEARNING RATE DECAY TRUE