Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Understanding the Impact of Entropy on Policy Optimization
Authors: Zafarali Ahmed, Nicolas Le Roux, Mohammad Norouzi, Dale Schuurmans
ICML 2019 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show experimentally that the difficulty of policy optimization is strongly linked to the geometry of the objective function. ... We show experimentally that policies with higher entropy induce a smoother objective that connects solutions and enable the use of larger learning rates. ... We conduct experiments in a setting where the optimization procedure has access to the exact gradient. ... Continuous control tasks from the Mu Jo Co simulator (Todorov et al., 2012; Brockman et al., 2016) facilitate studying the impact of entropy because we can parameterize policies using Gaussian distributions. |
| Researcher Affiliation | Collaboration | 1Mila, Mc Gill University, Montr eal, Canada 2Work done while at Google Research 3Google Research 4University of Alberta. Correspondence to: Zafarali Ahmed <EMAIL>. |
| Pseudocode | No | No pseudocode or algorithm blocks found. |
| Open Source Code | No | No explicit statement or link providing access to source code for the methodology described. |
| Open Datasets | Yes | We chose a 5 5 Gridworld with one suboptimal and one optimal reward at the corners (Figure 3). ... Continuous control tasks from the Mu Jo Co simulator (Todorov et al., 2012; Brockman et al., 2016) facilitate studying the impact of entropy because we can parameterize policies using Gaussian distributions. |
| Dataset Splits | No | No specific train/validation/test dataset splits (percentages or counts) are provided. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory) are provided. |
| Software Dependencies | No | The paper mentions 'Mu Jo Co simulator' but does not provide specific version numbers for it or any other software dependencies. |
| Experiment Setup | No | In Hopper and Walker, the best learning rate increases consistently with entropy: The learning rate for σ = 1 is 10 times larger than for σ = 0.1. We use a large batch size to control for the variance reduction effects of a larger σ (Zhao et al., 2011). While learning rates are shown in Figure 5 legend, explicit numerical values or ranges for all hyperparameters (e.g., exact batch size, initial learning rates for all experiments) are not formally stated in text for setup. |