Stochastic Policy Gradient Methods: Improved Sample Complexity for Fisher-non-degenerate Policies
Authors: Ilyas Fatkhullin, Anas Barakat, Anastasia Kireeva, Niao He
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we present the results on Humanoid and Reacher environments and defer the results on other tasks to Appendix A. For all methods, at each iteration we sample 20 trajectories per iteration, and each trajectory has a maximum length 𝐻= 500. To ensure the same per-iteration cost for all methods, in (N)-HARPG we sample half of the trajectories to compute the stochastic gradient and another half to estimate the Hessian-vector product. For a fair comparison of the different methods, we start all our runs from the same initial policy 𝜋𝜃0, where 𝜃0 is randomly initialized. |
| Researcher Affiliation | Academia | 1Department of Computer Science, ETH Zurich, Switzerland 2Department of Mathematics, ETH Zurich, Switzerland. Correspondence to: I.F. <ilyas.fn979@gmail.com>. |
| Pseudocode | Yes | Algorithm 1 N-PG-IGT (Normalized-PG-with Implicit Gradient Transport) Algorithm 2 (N)-HARPG ((Normalized)-Hessian-Aided Recursive Policy Gradient) Algorithm 3 N-MPG (Normalized-Momentum Policy Gradient) |
| Open Source Code | No | The paper states: "We implement the algorithms based on Vanilla-PG (REINFORCE) implementation in the garage library (garage contributors, 2019)" but does not provide an explicit statement or link for the open-sourcing of their own proposed methods (N-PG-IGT, HARPG, N-HARPG). |
| Open Datasets | Yes | We test the methods on the commonly used Mu Jo Co environments. In this section, we present the results on Humanoid and Reacher environments and defer the results on other tasks to Appendix A. For all methods, at each iteration we sample 20 trajectories per iteration, and each trajectory has a maximum length 𝐻= 500. |
| Dataset Splits | No | The paper describes training on MuJoCo environments but does not specify distinct training, validation, and testing dataset splits in terms of percentages or sample counts, which is common for reinforcement learning where data is generated dynamically through interaction. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware used (e.g., GPU/CPU models, memory, cloud instances) for running the experiments. It only refers to running them in "Mu Jo Co environments". |
| Software Dependencies | No | The paper mentions using the "garage library (garage contributors, 2019)", but it does not provide specific version numbers for this library or any other key software components, which is required for reproducible description of software dependencies. |
| Experiment Setup | Yes | For all methods, at each iteration we sample 20 trajectories per iteration, and each trajectory has a maximum length 𝐻= 500. To ensure the same per-iteration cost for all methods, in (N)-HARPG we sample half of the trajectories to compute the stochastic gradient and another half to estimate the Hessian-vector product. Table 3: Hyper-parameters and step-size choice. The initial step-size is chosen from the set of 13 values by the best performance in the last iteration. |