Beyond Stationarity: Convergence Analysis of Stochastic Softmax Policy Gradient Methods

Authors: Sara Klein, Simon Weissmann, Leif Döring

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To illustrate this phenomenon we implemented a simple toy example where the advantage of dynamic PG becomes visible. In Figure 1 one can see 5 simulations of the dynamic PG with different target accuracies (blue curves) plotted against one version of the simultaneous PG with target accuracy 0.1 (dashed magenta curve). The time-horizon is chosen as H = 5. More details on the example can be found in Appendix E.
Researcher Affiliation Academia Institute of Mathematics, University of Mannheim
Pseudocode Yes Algorithm 1: Simultaneous Policy Gradient for finite-time MDPs
Open Source Code No The paper does not contain any explicit statements about making the source code available or provide links to a code repository.
Open Datasets No The paper describes a 'numerical toy example' which is a custom-defined MDP problem involving dice throwing, rather than using or providing access to a named public dataset.
Dataset Splits No The paper analyzes a theoretical framework and a numerical toy example of an MDP, which does not involve explicit training/validation/test dataset splits in the conventional sense of machine learning datasets.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU/CPU models, memory) used for running the numerical experiments or simulations.
Software Dependencies No The paper does not specify any software dependencies with version numbers (e.g., programming languages, libraries, or frameworks) used for its implementation.
Experiment Setup Yes In the simulation we always initialised the parameters uniformly and chose θ 0. Furthermore we chose the suggested learning rates from Theorem 3.2 in the simultaneous approach and from Theorem 3.5 in the dynamic approach. ... In this simulation we chose ϵ = 5, 1, 0.5, 0.25, 0.12 to define the length of the training steps according to Theorem 3.5.