Mollification Effects of Policy Gradient Methods

Authors: Tao Wang, Sylvia Lee Herbert, Sicun Gao

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We also provide experimental results to illustrate both the positive and negative aspects of mollification effects in practice. Equipped with the theoretical results, we conduct experiments to illustrate how our framework can be applied to explain both the successes and failures in practice. In particular, from the view of mollification, we can characterize a class of control problems where RL algorithms consistently face challenges: the region of attraction for the optimal policy is extremely small and thus can be entirely eliminated by the Gaussian kernel in stochastic policies. It also explains why policy gradient methods always encounter difficulties when solving quadrotor-related problems and a detailed discussion is presented in Section 6.
Researcher Affiliation Academia 1University of California, San Diego, La Jolla, USA.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any statement or link to open-source code for the methodology described.
Open Datasets Yes We begin with a standard example in the Open AI GYM documentation (Brockman et al., 2016). As shown in Figure 6(e), the policy landscape for a randomly initialized policy is fractal due to the chaoticness in the underlying dynamics.
Dataset Splits No The paper does not explicitly provide training/test/validation dataset splits, but rather describes hyperparameters for training policies in simulation environments. For example, Table 1 lists 'Batch', 'Epoch', 'Horizon', 'Discount factor'.
Hardware Specification No The paper does not provide any specific hardware details used for running its experiments.
Software Dependencies No The paper does not specify any software dependencies with version numbers.
Experiment Setup Yes The hyperparameters used in Section 6 is summarized in Table 1. Regarding the controller, we use a 2-layer neural network u = W2 tanh W1s where the width of the hidden layer is 16. the activation function is tanh. The reward function R(s, a) = (s2 2+s2 3+s2 4+s2 5+0.1 (s2 6+s2 7+s2 8+s2 9+s2 10+s2 11)) 0.001 |a|2 where the coordinates are specified as in Table 2. The stepsize for policy update in each epoch is δ = 1.