reproducibilityindex.ai

e-COP : Episodic Constrained Optimization of Policies

Authors: Akhil Agnihotri, Rahul Jain, Deepak Ramachandran, Sahil Singla

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive empirical analysis using benchmarks in the Safety Gym suite, we show that our algorithm has similar or better performance than So TA (non-episodic) algorithms adapted for the episodic setting.
Researcher Affiliation	Collaboration	Akhil Agnihotri University of Southern California agnihotri.akhil@gmail.com Rahul Jain Google Deep Mind and USC rahulajain@google.com Deepak Ramachandran Google Deep Mind ramachandrand@google.com Sahil Singla Google Deep Mind sasingla@google.com
Pseudocode	Yes	Algorithm 1 Iterative Policy Optimization for Constrained Episodic (IPOCE) RL
Open Source Code	Yes	Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We have provided the code.
Open Datasets	Yes	For a comprehensive empirical evaluation, we selected eight scenarios from wellknown safe RL benchmark environments Safe Mu Jo Co [43] and Safety Gym [30], as well as Mu Jo Co environments.
Dataset Splits	No	The paper does not explicitly specify traditional train/validation/test dataset splits, as data in RL is generated dynamically. It specifies episode count and horizon for training and evaluation.
Hardware Specification	Yes	All experiments were implemented in Pytorch 1.7 .0 with CUDA 11.0 and conducted on an Ubuntu 20.04.2 LTS with 8 CPU cores (AMD Ryzen Threadripper PRO 3975WX 8-Coresz), 127G memory and 2 GPU cards (NVIDIA Ge Force RTX 4060 Ti Cards).
Software Dependencies	Yes	All experiments were implemented in Pytorch 1.7 .0 with CUDA 11.0
Experiment Setup	Yes	For the Circle task, we use a a point-mass with S Ď R9, A Ď R2 and for the Reach task, an ant robot with S Ď R16, A Ď R8. The Grid task has S Ď R56, A Ď R4. We use two hidden layer neural networks to represent Gaussian policies for the tasks. For Circle and Reach, size is (32,32) for both layers, and for Grid and Navigation the layer sizes are (16,16) and (25,25). We set the step size δ to 10 4, and for each task, we conduct 5 independent runs of K 500 episodes each of horizon H 200.