Constrained Policy Optimization

Authors: Joshua Achiam, David Held, Aviv Tamar, Pieter Abbeel

ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the effectiveness of our approach on simulated robot locomotion tasks where the agent must satisfy constraints motivated by safety. In our experiments, we show that CPO can train neural network policies with thousands of parameters on highdimensional simulated robot locomotion tasks to maximize rewards while successfully enforcing constraints.
Researcher Affiliation Collaboration 1UC Berkeley 2Open AI.
Pseudocode Yes Algorithm 1 Constrained Policy Optimization
Open Source Code Yes We give the pseudocode for our algorithm (for the single-constraint case) as Algorithm 1, and have made our code implementation available online.1 https://github.com/jachiam/cpo
Open Datasets No The paper uses 'simulated robot locomotion tasks' and environments like 'Point-Circle', 'Ant-Circle', 'Humanoid-Circle', 'Point-Gather', 'Ant-Gather' which are described as custom simulation environments, not standard public datasets with access information provided.
Dataset Splits No The paper does not explicitly provide specific dataset split information (e.g., percentages, sample counts, or explicit references to predefined splits) for training, validation, or testing.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies No Our experiments are implemented in rllab (Duan et al., 2016). This mentions a framework but does not provide specific version numbers for other ancillary software or libraries needed for replication.
Experiment Setup No For all experiments, we use neural network policies with two hidden layers of size (64, 32). This is a model architecture detail, but the paper does not provide concrete hyperparameter values (e.g., learning rate, batch size, number of epochs, optimizer settings) or comprehensive system-level training configurations.