Stochastic Policy Gradient Methods: Improved Sample Complexity for Fisher-non-degenerate Policies

Authors: Ilyas Fatkhullin, Anas Barakat, Anastasia Kireeva, Niao He

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we present the results on Humanoid and Reacher environments and defer the results on other tasks to Appendix A. For all methods, at each iteration we sample 20 trajectories per iteration, and each trajectory has a maximum length 𝐻= 500. To ensure the same per-iteration cost for all methods, in (N)-HARPG we sample half of the trajectories to compute the stochastic gradient and another half to estimate the Hessian-vector product. For a fair comparison of the different methods, we start all our runs from the same initial policy 𝜋𝜃0, where 𝜃0 is randomly initialized.
Researcher Affiliation Academia 1Department of Computer Science, ETH Zurich, Switzerland 2Department of Mathematics, ETH Zurich, Switzerland. Correspondence to: I.F. <ilyas.fn979@gmail.com>.
Pseudocode Yes Algorithm 1 N-PG-IGT (Normalized-PG-with Implicit Gradient Transport) Algorithm 2 (N)-HARPG ((Normalized)-Hessian-Aided Recursive Policy Gradient) Algorithm 3 N-MPG (Normalized-Momentum Policy Gradient)
Open Source Code No The paper states: "We implement the algorithms based on Vanilla-PG (REINFORCE) implementation in the garage library (garage contributors, 2019)" but does not provide an explicit statement or link for the open-sourcing of their own proposed methods (N-PG-IGT, HARPG, N-HARPG).
Open Datasets Yes We test the methods on the commonly used Mu Jo Co environments. In this section, we present the results on Humanoid and Reacher environments and defer the results on other tasks to Appendix A. For all methods, at each iteration we sample 20 trajectories per iteration, and each trajectory has a maximum length 𝐻= 500.
Dataset Splits No The paper describes training on MuJoCo environments but does not specify distinct training, validation, and testing dataset splits in terms of percentages or sample counts, which is common for reinforcement learning where data is generated dynamically through interaction.
Hardware Specification No The paper does not explicitly describe the specific hardware used (e.g., GPU/CPU models, memory, cloud instances) for running the experiments. It only refers to running them in "Mu Jo Co environments".
Software Dependencies No The paper mentions using the "garage library (garage contributors, 2019)", but it does not provide specific version numbers for this library or any other key software components, which is required for reproducible description of software dependencies.
Experiment Setup Yes For all methods, at each iteration we sample 20 trajectories per iteration, and each trajectory has a maximum length 𝐻= 500. To ensure the same per-iteration cost for all methods, in (N)-HARPG we sample half of the trajectories to compute the stochastic gradient and another half to estimate the Hessian-vector product. Table 3: Hyper-parameters and step-size choice. The initial step-size is chosen from the set of 13 values by the best performance in the last iteration.