Policy Optimization by Genetic Distillation
Authors: Tanmay Gangwani, Jian Peng
ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments on Mu Jo Co tasks show that GPO as a genetic algorithm is able to provide superior performance over the state-of-the-art policy gradient methods and achieves comparable or higher sample efficiency. |
| Researcher Affiliation | Academia | Tanmay Gangwani Computer Science UIUC Urbana, IL 61801 gangwan2@illinios.edu Jian Peng Computer Science UIUC Urbana, IL 61801 jianpeng@illinois.edu |
| Pseudocode | Yes | Algorithm 1 Genetic Policy Optimization |
| Open Source Code | No | The paper mentions using the Open AI rllab framework but does not provide a link or explicit statement about the availability of their own source code. |
| Open Datasets | Yes | All our experiments are done using the Open AI rllab framework (Duan et al., 2016). We benchmark 9 continuous-control locomotion tasks based on the Mu Jo Co physics simulator 1. The Hilly variants are more difficult versions of the original environments (https: //github.com/rll/rllab/pull/121). |
| Dataset Splits | No | The paper describes training procedures and data collection, but does not provide specific train/validation/test dataset splits with percentages or counts. |
| Hardware Specification | Yes | All runs use the same number of simulation timesteps, and are done on an Intel Xeon machine 3 with 12 cores. 3Intel CPU E5-2620 v3 @ 2.40GHz |
| Software Dependencies | No | The paper mentions using the Open AI rllab framework and the Adam optimizer, but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | All our control policies are Gaussian, with the mean parameterized by a neural network of two hidden layers (64 hidden units each), and linear units for the final output layer. The diagonal co-variance matrix is learnt as a parameter, independent of the input observation, similar to (Schulman et al., 2015; 2017). The binary policy (πS) used for crossover has two hidden layers (32 hidden units each), followed by a softmax. The value-function baseline used for advantage estimation also has two hidden layers (32 hidden units each). All neural networks use tanh as the non-linearity at the hidden units. Horizon (T) = 512, Discount (γ) = 0.99, PPO epochs = 10, PPO/A2C batch-size (GPO, Single) = 2048, PPO/A2C batch-size (Joint) = 16384 |