Constrained Policy Optimization
Authors: Joshua Achiam, David Held, Aviv Tamar, Pieter Abbeel
ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the effectiveness of our approach on simulated robot locomotion tasks where the agent must satisfy constraints motivated by safety. In our experiments, we show that CPO can train neural network policies with thousands of parameters on highdimensional simulated robot locomotion tasks to maximize rewards while successfully enforcing constraints. |
| Researcher Affiliation | Collaboration | 1UC Berkeley 2Open AI. |
| Pseudocode | Yes | Algorithm 1 Constrained Policy Optimization |
| Open Source Code | Yes | We give the pseudocode for our algorithm (for the single-constraint case) as Algorithm 1, and have made our code implementation available online.1 https://github.com/jachiam/cpo |
| Open Datasets | No | The paper uses 'simulated robot locomotion tasks' and environments like 'Point-Circle', 'Ant-Circle', 'Humanoid-Circle', 'Point-Gather', 'Ant-Gather' which are described as custom simulation environments, not standard public datasets with access information provided. |
| Dataset Splits | No | The paper does not explicitly provide specific dataset split information (e.g., percentages, sample counts, or explicit references to predefined splits) for training, validation, or testing. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., GPU/CPU models, memory) used for running its experiments. |
| Software Dependencies | No | Our experiments are implemented in rllab (Duan et al., 2016). This mentions a framework but does not provide specific version numbers for other ancillary software or libraries needed for replication. |
| Experiment Setup | No | For all experiments, we use neural network policies with two hidden layers of size (64, 32). This is a model architecture detail, but the paper does not provide concrete hyperparameter values (e.g., learning rate, batch size, number of epochs, optimizer settings) or comprehensive system-level training configurations. |