Mollification Effects of Policy Gradient Methods
Authors: Tao Wang, Sylvia Lee Herbert, Sicun Gao
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We also provide experimental results to illustrate both the positive and negative aspects of mollification effects in practice. Equipped with the theoretical results, we conduct experiments to illustrate how our framework can be applied to explain both the successes and failures in practice. In particular, from the view of mollification, we can characterize a class of control problems where RL algorithms consistently face challenges: the region of attraction for the optimal policy is extremely small and thus can be entirely eliminated by the Gaussian kernel in stochastic policies. It also explains why policy gradient methods always encounter difficulties when solving quadrotor-related problems and a detailed discussion is presented in Section 6. |
| Researcher Affiliation | Academia | 1University of California, San Diego, La Jolla, USA. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any statement or link to open-source code for the methodology described. |
| Open Datasets | Yes | We begin with a standard example in the Open AI GYM documentation (Brockman et al., 2016). As shown in Figure 6(e), the policy landscape for a randomly initialized policy is fractal due to the chaoticness in the underlying dynamics. |
| Dataset Splits | No | The paper does not explicitly provide training/test/validation dataset splits, but rather describes hyperparameters for training policies in simulation environments. For example, Table 1 lists 'Batch', 'Epoch', 'Horizon', 'Discount factor'. |
| Hardware Specification | No | The paper does not provide any specific hardware details used for running its experiments. |
| Software Dependencies | No | The paper does not specify any software dependencies with version numbers. |
| Experiment Setup | Yes | The hyperparameters used in Section 6 is summarized in Table 1. Regarding the controller, we use a 2-layer neural network u = W2 tanh W1s where the width of the hidden layer is 16. the activation function is tanh. The reward function R(s, a) = (s2 2+s2 3+s2 4+s2 5+0.1 (s2 6+s2 7+s2 8+s2 9+s2 10+s2 11)) 0.001 |a|2 where the coordinates are specified as in Table 2. The stepsize for policy update in each epoch is δ = 1. |