e-COP : Episodic Constrained Optimization of Policies
Authors: Akhil Agnihotri, Rahul Jain, Deepak Ramachandran, Sahil Singla
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive empirical analysis using benchmarks in the Safety Gym suite, we show that our algorithm has similar or better performance than So TA (non-episodic) algorithms adapted for the episodic setting. |
| Researcher Affiliation | Collaboration | Akhil Agnihotri University of Southern California agnihotri.akhil@gmail.com Rahul Jain Google Deep Mind and USC rahulajain@google.com Deepak Ramachandran Google Deep Mind ramachandrand@google.com Sahil Singla Google Deep Mind sasingla@google.com |
| Pseudocode | Yes | Algorithm 1 Iterative Policy Optimization for Constrained Episodic (IPOCE) RL |
| Open Source Code | Yes | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We have provided the code. |
| Open Datasets | Yes | For a comprehensive empirical evaluation, we selected eight scenarios from wellknown safe RL benchmark environments Safe Mu Jo Co [43] and Safety Gym [30], as well as Mu Jo Co environments. |
| Dataset Splits | No | The paper does not explicitly specify traditional train/validation/test dataset splits, as data in RL is generated dynamically. It specifies episode count and horizon for training and evaluation. |
| Hardware Specification | Yes | All experiments were implemented in Pytorch 1.7 .0 with CUDA 11.0 and conducted on an Ubuntu 20.04.2 LTS with 8 CPU cores (AMD Ryzen Threadripper PRO 3975WX 8-Coresz), 127G memory and 2 GPU cards (NVIDIA Ge Force RTX 4060 Ti Cards). |
| Software Dependencies | Yes | All experiments were implemented in Pytorch 1.7 .0 with CUDA 11.0 |
| Experiment Setup | Yes | For the Circle task, we use a a point-mass with S Ď R9, A Ď R2 and for the Reach task, an ant robot with S Ď R16, A Ď R8. The Grid task has S Ď R56, A Ď R4. We use two hidden layer neural networks to represent Gaussian policies for the tasks. For Circle and Reach, size is (32,32) for both layers, and for Grid and Navigation the layer sizes are (16,16) and (25,25). We set the step size δ to 10 4, and for each task, we conduct 5 independent runs of K 500 episodes each of horizon H 200. |