reproducibilityindex.ai

Beyond Stationarity: Convergence Analysis of Stochastic Softmax Policy Gradient Methods

Authors: Sara Klein, Simon Weissmann, Leif Döring

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To illustrate this phenomenon we implemented a simple toy example where the advantage of dynamic PG becomes visible. In Figure 1 one can see 5 simulations of the dynamic PG with different target accuracies (blue curves) plotted against one version of the simultaneous PG with target accuracy 0.1 (dashed magenta curve). The time-horizon is chosen as H = 5. More details on the example can be found in Appendix E.
Researcher Affiliation	Academia	Institute of Mathematics, University of Mannheim
Pseudocode	Yes	Algorithm 1: Simultaneous Policy Gradient for finite-time MDPs
Open Source Code	No	The paper does not contain any explicit statements about making the source code available or provide links to a code repository.
Open Datasets	No	The paper describes a 'numerical toy example' which is a custom-defined MDP problem involving dice throwing, rather than using or providing access to a named public dataset.
Dataset Splits	No	The paper analyzes a theoretical framework and a numerical toy example of an MDP, which does not involve explicit training/validation/test dataset splits in the conventional sense of machine learning datasets.
Hardware Specification	No	The paper does not provide any specific details about the hardware (e.g., GPU/CPU models, memory) used for running the numerical experiments or simulations.
Software Dependencies	No	The paper does not specify any software dependencies with version numbers (e.g., programming languages, libraries, or frameworks) used for its implementation.
Experiment Setup	Yes	In the simulation we always initialised the parameters uniformly and chose θ 0. Furthermore we chose the suggested learning rates from Theorem 3.2 in the simultaneous approach and from Theorem 3.5 in the dynamic approach. ... In this simulation we chose ϵ = 5, 1, 0.5, 0.25, 0.12 to define the length of the training steps according to Theorem 3.5.