Policy Optimization by Genetic Distillation

Authors: Tanmay Gangwani, Jian Peng

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments on Mu Jo Co tasks show that GPO as a genetic algorithm is able to provide superior performance over the state-of-the-art policy gradient methods and achieves comparable or higher sample efficiency.
Researcher Affiliation Academia Tanmay Gangwani Computer Science UIUC Urbana, IL 61801 gangwan2@illinios.edu Jian Peng Computer Science UIUC Urbana, IL 61801 jianpeng@illinois.edu
Pseudocode Yes Algorithm 1 Genetic Policy Optimization
Open Source Code No The paper mentions using the Open AI rllab framework but does not provide a link or explicit statement about the availability of their own source code.
Open Datasets Yes All our experiments are done using the Open AI rllab framework (Duan et al., 2016). We benchmark 9 continuous-control locomotion tasks based on the Mu Jo Co physics simulator 1. The Hilly variants are more difficult versions of the original environments (https: //github.com/rll/rllab/pull/121).
Dataset Splits No The paper describes training procedures and data collection, but does not provide specific train/validation/test dataset splits with percentages or counts.
Hardware Specification Yes All runs use the same number of simulation timesteps, and are done on an Intel Xeon machine 3 with 12 cores. 3Intel CPU E5-2620 v3 @ 2.40GHz
Software Dependencies No The paper mentions using the Open AI rllab framework and the Adam optimizer, but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes All our control policies are Gaussian, with the mean parameterized by a neural network of two hidden layers (64 hidden units each), and linear units for the final output layer. The diagonal co-variance matrix is learnt as a parameter, independent of the input observation, similar to (Schulman et al., 2015; 2017). The binary policy (πS) used for crossover has two hidden layers (32 hidden units each), followed by a softmax. The value-function baseline used for advantage estimation also has two hidden layers (32 hidden units each). All neural networks use tanh as the non-linearity at the hidden units. Horizon (T) = 512, Discount (γ) = 0.99, PPO epochs = 10, PPO/A2C batch-size (GPO, Single) = 2048, PPO/A2C batch-size (Joint) = 16384