reproducibilityindex.ai

Reward Constrained Policy Optimization

Authors: Chen Tessler, Daniel J. Mankowitz, Shie Mannor

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We test the RCPO algorithm in various domains: a grid-world, and 6 tasks in the Mujoco simulator (Todorov et al., 2012). We prove the convergence of our approach and provide empirical evidence of its ability to train constraint satisfying policies.
Researcher Affiliation	Collaboration	Chen Tessler1, Daniel J. Mankowitz2, and Shie Mannor1 1Technion Israel Institute of Technology, Haifa, Israel 2Deep Mind, London, England
Pseudocode	Yes	Algorithm 1 Template for an RCPO implementation
Open Source Code	No	For these experiments we used a Py Torch (Paszke et al., 2017) implementation of PPO (Kostrikov, 2018). https://github.com/ikostrikov/pytorch-a2c-ppo-acktr. (This link is for a third-party implementation of PPO, not the authors' RCPO code).
Open Datasets	No	The paper describes using the Mujoco simulator (Todorov et al., 2012) and Open AI (2017) interfaces, but does not provide concrete access information (link, DOI, formal citation with author/year for a dataset) for a publicly available or open dataset used for training. The Mars Rover domain is custom-built.
Dataset Splits	No	The paper does not provide specific dataset split information (percentages, sample counts, citations to predefined splits, or detailed splitting methodology) for training, validation, or testing.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types with speeds, or memory amounts) used for running its experiments.
Software Dependencies	No	The paper mentions 'Py Torch' but does not specify its version number. It refers to 'PPO' as an algorithm but not a specific software package with a version number for replication.
Experiment Setup	Yes	For these experiments we used a Py Torch (Paszke et al., 2017) implementation of PPO (Kostrikov, 2018). Notice that as in each domain the state represents the location and velocity of each joint, the number of inputs differs between domains. The network is as follows: Layer Actor Critic 1 Linear(input = x, output = 64) Linear(input = x, output = 64) 2 Linear(input = 64, output = 64) Linear(input = 64, output = 64) 3 Diag Gaussian(input = 64, output = y) Linear(input = 64, output = 1) LR 3e-4 1.5e-4 where Diag Gaussian is a multivariate Gaussian distribution layer which learns a mean (as a function of the previous layers output) and std, per each motor, from which the torque is sampled. Between each layer, a Tanh non-linearity is applied. We report the online performance of the agent and run each test for a total of 1M samples. In these domains we start λ at 0 and use a learning rate lrλ = 5e−7 which decays at a rate of κ = (1 − 1e−9) in order to avoid oscillations. The simulations were run using Generalized Advantage Estimation (Schulman et al., 2015b) with coefficient τ = 0.95 and discount factor γ = 0.99. The MDP was defined as follows: rstep = 0.01, rgoal = 0, δ = 0.05, γ = 0.99. ... we start λ at 0.6 and use a learning rate lrλ = 0.000025.