reproducibilityindex.ai

Policy Optimization by Genetic Distillation

Authors: Tanmay Gangwani, Jian Peng

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments on Mu Jo Co tasks show that GPO as a genetic algorithm is able to provide superior performance over the state-of-the-art policy gradient methods and achieves comparable or higher sample efﬁciency.
Researcher Affiliation	Academia	Tanmay Gangwani Computer Science UIUC Urbana, IL 61801 gangwan2@illinios.edu Jian Peng Computer Science UIUC Urbana, IL 61801 jianpeng@illinois.edu
Pseudocode	Yes	Algorithm 1 Genetic Policy Optimization
Open Source Code	No	The paper mentions using the Open AI rllab framework but does not provide a link or explicit statement about the availability of their own source code.
Open Datasets	Yes	All our experiments are done using the Open AI rllab framework (Duan et al., 2016). We benchmark 9 continuous-control locomotion tasks based on the Mu Jo Co physics simulator 1. The Hilly variants are more difﬁcult versions of the original environments (https: //github.com/rll/rllab/pull/121).
Dataset Splits	No	The paper describes training procedures and data collection, but does not provide specific train/validation/test dataset splits with percentages or counts.
Hardware Specification	Yes	All runs use the same number of simulation timesteps, and are done on an Intel Xeon machine 3 with 12 cores. 3Intel CPU E5-2620 v3 @ 2.40GHz
Software Dependencies	No	The paper mentions using the Open AI rllab framework and the Adam optimizer, but does not provide specific version numbers for these or other software dependencies.
Experiment Setup	Yes	All our control policies are Gaussian, with the mean parameterized by a neural network of two hidden layers (64 hidden units each), and linear units for the ﬁnal output layer. The diagonal co-variance matrix is learnt as a parameter, independent of the input observation, similar to (Schulman et al., 2015; 2017). The binary policy (πS) used for crossover has two hidden layers (32 hidden units each), followed by a softmax. The value-function baseline used for advantage estimation also has two hidden layers (32 hidden units each). All neural networks use tanh as the non-linearity at the hidden units. Horizon (T) = 512, Discount (γ) = 0.99, PPO epochs = 10, PPO/A2C batch-size (GPO, Single) = 2048, PPO/A2C batch-size (Joint) = 16384