Understanding the Impact of Entropy on Policy Optimization

Authors: Zafarali Ahmed, Nicolas Le Roux, Mohammad Norouzi, Dale Schuurmans

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show experimentally that the difficulty of policy optimization is strongly linked to the geometry of the objective function. ... We show experimentally that policies with higher entropy induce a smoother objective that connects solutions and enable the use of larger learning rates. ... We conduct experiments in a setting where the optimization procedure has access to the exact gradient. ... Continuous control tasks from the Mu Jo Co simulator (Todorov et al., 2012; Brockman et al., 2016) facilitate studying the impact of entropy because we can parameterize policies using Gaussian distributions.
Researcher Affiliation Collaboration 1Mila, Mc Gill University, Montr eal, Canada 2Work done while at Google Research 3Google Research 4University of Alberta. Correspondence to: Zafarali Ahmed <zafarali.ahmed@mail.mcgill.ca>.
Pseudocode No No pseudocode or algorithm blocks found.
Open Source Code No No explicit statement or link providing access to source code for the methodology described.
Open Datasets Yes We chose a 5 5 Gridworld with one suboptimal and one optimal reward at the corners (Figure 3). ... Continuous control tasks from the Mu Jo Co simulator (Todorov et al., 2012; Brockman et al., 2016) facilitate studying the impact of entropy because we can parameterize policies using Gaussian distributions.
Dataset Splits No No specific train/validation/test dataset splits (percentages or counts) are provided.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory) are provided.
Software Dependencies No The paper mentions 'Mu Jo Co simulator' but does not provide specific version numbers for it or any other software dependencies.
Experiment Setup No In Hopper and Walker, the best learning rate increases consistently with entropy: The learning rate for σ = 1 is 10 times larger than for σ = 0.1. We use a large batch size to control for the variance reduction effects of a larger σ (Zhao et al., 2011). While learning rates are shown in Figure 5 legend, explicit numerical values or ranges for all hyperparameters (e.g., exact batch size, initial learning rates for all experiments) are not formally stated in text for setup.