Understanding the Impact of Entropy on Policy Optimization
Authors: Zafarali Ahmed, Nicolas Le Roux, Mohammad Norouzi, Dale Schuurmans
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show experimentally that the difficulty of policy optimization is strongly linked to the geometry of the objective function. ... We show experimentally that policies with higher entropy induce a smoother objective that connects solutions and enable the use of larger learning rates. ... We conduct experiments in a setting where the optimization procedure has access to the exact gradient. ... Continuous control tasks from the Mu Jo Co simulator (Todorov et al., 2012; Brockman et al., 2016) facilitate studying the impact of entropy because we can parameterize policies using Gaussian distributions. |
| Researcher Affiliation | Collaboration | 1Mila, Mc Gill University, Montr eal, Canada 2Work done while at Google Research 3Google Research 4University of Alberta. Correspondence to: Zafarali Ahmed <zafarali.ahmed@mail.mcgill.ca>. |
| Pseudocode | No | No pseudocode or algorithm blocks found. |
| Open Source Code | No | No explicit statement or link providing access to source code for the methodology described. |
| Open Datasets | Yes | We chose a 5 5 Gridworld with one suboptimal and one optimal reward at the corners (Figure 3). ... Continuous control tasks from the Mu Jo Co simulator (Todorov et al., 2012; Brockman et al., 2016) facilitate studying the impact of entropy because we can parameterize policies using Gaussian distributions. |
| Dataset Splits | No | No specific train/validation/test dataset splits (percentages or counts) are provided. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory) are provided. |
| Software Dependencies | No | The paper mentions 'Mu Jo Co simulator' but does not provide specific version numbers for it or any other software dependencies. |
| Experiment Setup | No | In Hopper and Walker, the best learning rate increases consistently with entropy: The learning rate for σ = 1 is 10 times larger than for σ = 0.1. We use a large batch size to control for the variance reduction effects of a larger σ (Zhao et al., 2011). While learning rates are shown in Figure 5 legend, explicit numerical values or ranges for all hyperparameters (e.g., exact batch size, initial learning rates for all experiments) are not formally stated in text for setup. |