reproducibilityindex.ai

Flipping-based Policy for Chance-Constrained Markov Decision Processes

Authors: Xun Shen, Shuo Jiang, Akifumi Wachi, Kazumune Hashimoto, Sebastien Gros

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	5 Experiments We conduct a numerical example to illustrate how the flipping-based policy outperforms the deterministic policy in CCMDPs. ... We conduct experiments on Safety Gym [34], where an agent must maximize the expected cumulative reward under a safety constraint with additive structures.
Researcher Affiliation	Collaboration	Xun Shen Osaka University shenxun@eei.eng.osaka-u.ac.jp Shuo Jiang Osaka University u316354h@ecs.osaka-u.ac.jp Akifumi Wachi LY Corporation akifumi.wachi@lycorp.co.jp Kazumune Hashimoto Osaka University hashimoto@eei.eng.osaka-u.ac.jp Sebastien Gros Norwegian University of Science and Technology sebastien.gros@ntnu.no
Pseudocode	Yes	Algorithm 1 General training algorithm for flipping-based policy
Open Source Code	Yes	We have provided the code in the supplemental material.
Open Datasets	Yes	We conduct experiments on Safety Gym [34], where an agent must maximize the expected cumulative reward under a safety constraint with additive structures. ... Ray, A., Achiam, J., and Amodei, D. (2019). Benchmarking safe exploration in deep reinforcement learning. Open AI.
Dataset Splits	No	The paper describes a training and testing process involving randomly generated initial states and goal points for testing, but it does not specify explicit dataset splits (e.g., percentages or sample counts) for training, validation, or testing.
Hardware Specification	Yes	We used a machine with Intel(R) Core(TM) i7-14700 CPU, 32GB RAM, and NVIDIA 4060 GPU.
Software Dependencies	No	The infrastructural framework for performing safe RL algorithms is Omni Safe [24]. No specific version numbers for software dependencies are provided.
Experiment Setup	Yes	Table 1: Hyper-parameters for Safety Gym experiments. ... COMMON PARAMETERS NETWORK ARCHITECTURE [64, 64] ACTIVATION FUNCTION tanh LEARNING RATE (CRITIC) 2 10 4 LEARNING RATE (POLICY) 3 10 3 LEARNING RATE (PENALTY) 0.0 DISCOUNT FACTOR (REWARD) 0.99 DISCOUNT FACTOR (SAFETY) 0.995 STEPS PER EPOCH 40, 000 NUMBER OF CONJUGATE GRADIENT ITERATIONS 20 NUMBER OF ITERATIONS TO UPDATE THE POLICY 10 NUMBER OF EPOCHS 500 TARGET KL 0.01 BATCH SIZE FOR EACH ITERATION 1024 DAMPING COEFFICIENT 0.1 CRITIC NORM COEFFICIENT 0.001 STD UPPER BOUND, AND LOWER BOUND [0.425, 0.125] LINEAR LEARNING RATE DECAY TRUE