Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Structure Matters: Dynamic Policy Gradient

Authors: Sara Klein, Xiangyuan Zhang, Tamer Basar, Simon Weissmann, Leif Döring

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To demonstrate Dyn PG s effectiveness we present a numerical study of a canonical example where vanilla PG suffers from committal behavior. A detailed description of the MDP and the experimental setup can be found in Appendix B. This theory paper contributes to the understanding of PG methods by directly including dynamic programming to gradient based policy search. To this end, we have introduced Dyn PG, an algorithm that directly combines DP and gradient ascent to solve γ-discounted infinite horizon MDPs. We provide mathematically rigorous performance estimates (for exact gradients in the main text, estimated gradients in Appendix F) in simplified situations which are typical for theory papers on PG methods.
Researcher Affiliation Collaboration 1Institut for Mathematics, University of Mannheim, 68138 Mannheim EMAIL 2 Department of Electrical and Computer Engineering, and Coordinated Science Laboratory, University of Illinois Urbana-Champaign, Urbana, IL 61801 EMAIL 3August-Wilhelm Scheer Institute for Digital Products and Processes g Gmb H, 66123 Saarbrücken EMAIL
Pseudocode Yes Algorithm 1: Dyn PG Algorithm 2: Dyn AC
Open Source Code Yes The code is available at https: //github.com/Sara-Klein/Structure Matters-Dyn PG.
Open Datasets No The example is an extension of [24, Example 6.7], which has been used to compare different variants of Q-learning algorithms, as it suffers from overestimation of the Q-values. The MDP is defined as follows: The state space is given by S := {0, . . . , 6}; States 0, 3, 6 are the terminal states and states 1, 2, 4, 5 are the initial states. We sample s0 uniformly from {1, 2, 4, 5}. The action space is given by A := {0, . . . , 299}.
Dataset Splits No The state space is given by S := {0, . . . , 6}; States 0, 3, 6 are the terminal states and states 1, 2, 4, 5 are the initial states. We sample s0 uniformly from {1, 2, 4, 5}. This describes sampling initial states for the MDP, not traditional dataset splits for train/test/validation.
Hardware Specification No The NeurIPS Paper Checklist for "Experiments compute resources" states "Answer: [NA] Justification: Toy example which can be run without specific computer resources." This indicates that no specific hardware details are provided for the experiments.
Software Dependencies No The paper does not explicitly list any software dependencies with specific version numbers. While code is provided, the text does not contain details such as
Experiment Setup Yes Experimental setup: We evaluated the performance of (stochastic) vanilla PG and (stochastic) Dyn PG under two different discount factors, γ = 0.9 and γ = 0.99. We used the tabular softmax parametrization studied in the convergence analysis for both algorithms. In Dyn PG, we used the 1-batch Monte-Carlo estimator to sample the gradient according to Theorem A.3. In vanilla PG, we chose the classical REINFORCE 1-batch estimator with truncation horizon 3, such that the estimator is also unbiased due to the episodic setting (the maximum episode length in our example is 3). In Dyn PG, we chose the step size ηh and number of training steps Nh according to Theorem D.6 and Theorem D.8, and only fine-tuned the constants 2 and 45: ηh = 1−γh / 2 , Nh = l 45 / (1−γh+1) . For a fair comparison, we fine-tuned η = 2 (1 γ) / (1 γ6) for stochastic vanilla PG, which is much larger than the pessimistic η = c (1 γ)3 suggested in [16].