Flipping-based Policy for Chance-Constrained Markov Decision Processes
Authors: Xun Shen, Shuo Jiang, Akifumi Wachi, Kazumune Hashimoto, Sebastien Gros
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 Experiments We conduct a numerical example to illustrate how the flipping-based policy outperforms the deterministic policy in CCMDPs. ... We conduct experiments on Safety Gym [34], where an agent must maximize the expected cumulative reward under a safety constraint with additive structures. |
| Researcher Affiliation | Collaboration | Xun Shen Osaka University shenxun@eei.eng.osaka-u.ac.jp Shuo Jiang Osaka University u316354h@ecs.osaka-u.ac.jp Akifumi Wachi LY Corporation akifumi.wachi@lycorp.co.jp Kazumune Hashimoto Osaka University hashimoto@eei.eng.osaka-u.ac.jp Sebastien Gros Norwegian University of Science and Technology sebastien.gros@ntnu.no |
| Pseudocode | Yes | Algorithm 1 General training algorithm for flipping-based policy |
| Open Source Code | Yes | We have provided the code in the supplemental material. |
| Open Datasets | Yes | We conduct experiments on Safety Gym [34], where an agent must maximize the expected cumulative reward under a safety constraint with additive structures. ... Ray, A., Achiam, J., and Amodei, D. (2019). Benchmarking safe exploration in deep reinforcement learning. Open AI. |
| Dataset Splits | No | The paper describes a training and testing process involving randomly generated initial states and goal points for testing, but it does not specify explicit dataset splits (e.g., percentages or sample counts) for training, validation, or testing. |
| Hardware Specification | Yes | We used a machine with Intel(R) Core(TM) i7-14700 CPU, 32GB RAM, and NVIDIA 4060 GPU. |
| Software Dependencies | No | The infrastructural framework for performing safe RL algorithms is Omni Safe [24]. No specific version numbers for software dependencies are provided. |
| Experiment Setup | Yes | Table 1: Hyper-parameters for Safety Gym experiments. ... COMMON PARAMETERS NETWORK ARCHITECTURE [64, 64] ACTIVATION FUNCTION tanh LEARNING RATE (CRITIC) 2 10 4 LEARNING RATE (POLICY) 3 10 3 LEARNING RATE (PENALTY) 0.0 DISCOUNT FACTOR (REWARD) 0.99 DISCOUNT FACTOR (SAFETY) 0.995 STEPS PER EPOCH 40, 000 NUMBER OF CONJUGATE GRADIENT ITERATIONS 20 NUMBER OF ITERATIONS TO UPDATE THE POLICY 10 NUMBER OF EPOCHS 500 TARGET KL 0.01 BATCH SIZE FOR EACH ITERATION 1024 DAMPING COEFFICIENT 0.1 CRITIC NORM COEFFICIENT 0.001 STD UPPER BOUND, AND LOWER BOUND [0.425, 0.125] LINEAR LEARNING RATE DECAY TRUE |