Beyond Stationarity: Convergence Analysis of Stochastic Softmax Policy Gradient Methods
Authors: Sara Klein, Simon Weissmann, Leif Döring
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To illustrate this phenomenon we implemented a simple toy example where the advantage of dynamic PG becomes visible. In Figure 1 one can see 5 simulations of the dynamic PG with different target accuracies (blue curves) plotted against one version of the simultaneous PG with target accuracy 0.1 (dashed magenta curve). The time-horizon is chosen as H = 5. More details on the example can be found in Appendix E. |
| Researcher Affiliation | Academia | Institute of Mathematics, University of Mannheim |
| Pseudocode | Yes | Algorithm 1: Simultaneous Policy Gradient for finite-time MDPs |
| Open Source Code | No | The paper does not contain any explicit statements about making the source code available or provide links to a code repository. |
| Open Datasets | No | The paper describes a 'numerical toy example' which is a custom-defined MDP problem involving dice throwing, rather than using or providing access to a named public dataset. |
| Dataset Splits | No | The paper analyzes a theoretical framework and a numerical toy example of an MDP, which does not involve explicit training/validation/test dataset splits in the conventional sense of machine learning datasets. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU/CPU models, memory) used for running the numerical experiments or simulations. |
| Software Dependencies | No | The paper does not specify any software dependencies with version numbers (e.g., programming languages, libraries, or frameworks) used for its implementation. |
| Experiment Setup | Yes | In the simulation we always initialised the parameters uniformly and chose θ 0. Furthermore we chose the suggested learning rates from Theorem 3.2 in the simultaneous approach and from Theorem 3.5 in the dynamic approach. ... In this simulation we chose ϵ = 5, 1, 0.5, 0.25, 0.12 to define the length of the training steps according to Theorem 3.5. |