Reward Constrained Policy Optimization
Authors: Chen Tessler, Daniel J. Mankowitz, Shie Mannor
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We test the RCPO algorithm in various domains: a grid-world, and 6 tasks in the Mujoco simulator (Todorov et al., 2012). We prove the convergence of our approach and provide empirical evidence of its ability to train constraint satisfying policies. |
| Researcher Affiliation | Collaboration | Chen Tessler1, Daniel J. Mankowitz2, and Shie Mannor1 1Technion Israel Institute of Technology, Haifa, Israel 2Deep Mind, London, England |
| Pseudocode | Yes | Algorithm 1 Template for an RCPO implementation |
| Open Source Code | No | For these experiments we used a Py Torch (Paszke et al., 2017) implementation of PPO (Kostrikov, 2018). https://github.com/ikostrikov/pytorch-a2c-ppo-acktr. (This link is for a third-party implementation of PPO, not the authors' RCPO code). |
| Open Datasets | No | The paper describes using the Mujoco simulator (Todorov et al., 2012) and Open AI (2017) interfaces, but does not provide concrete access information (link, DOI, formal citation with author/year for a dataset) for a publicly available or open dataset used for training. The Mars Rover domain is custom-built. |
| Dataset Splits | No | The paper does not provide specific dataset split information (percentages, sample counts, citations to predefined splits, or detailed splitting methodology) for training, validation, or testing. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types with speeds, or memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper mentions 'Py Torch' but does not specify its version number. It refers to 'PPO' as an algorithm but not a specific software package with a version number for replication. |
| Experiment Setup | Yes | For these experiments we used a Py Torch (Paszke et al., 2017) implementation of PPO (Kostrikov, 2018). Notice that as in each domain the state represents the location and velocity of each joint, the number of inputs differs between domains. The network is as follows: Layer Actor Critic 1 Linear(input = x, output = 64) Linear(input = x, output = 64) 2 Linear(input = 64, output = 64) Linear(input = 64, output = 64) 3 Diag Gaussian(input = 64, output = y) Linear(input = 64, output = 1) LR 3e-4 1.5e-4 where Diag Gaussian is a multivariate Gaussian distribution layer which learns a mean (as a function of the previous layers output) and std, per each motor, from which the torque is sampled. Between each layer, a Tanh non-linearity is applied. We report the online performance of the agent and run each test for a total of 1M samples. In these domains we start λ at 0 and use a learning rate lrλ = 5e−7 which decays at a rate of κ = (1 − 1e−9) in order to avoid oscillations. The simulations were run using Generalized Advantage Estimation (Schulman et al., 2015b) with coefficient τ = 0.95 and discount factor γ = 0.99. The MDP was defined as follows: rstep = 0.01, rgoal = 0, δ = 0.05, γ = 0.99. ... we start λ at 0.6 and use a learning rate lrλ = 0.000025. |