Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Global Convergence of Policy Gradient in Average Reward MDPs

Authors: Navdeep Kumar, Yashaswini Murthy, Itai Shufaro, Kfir Y Levy, R. Srikant, Shie Mannor

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We also present simulations that empirically validate the result. We also present simulations that empirically validate the result.
Researcher Affiliation Collaboration Navdeep Kumar Electrical and Computer Engineering Technion Israel Institute of Technology EMAIL Yashaswini Murthy ECE & CSL University of Illinois Urbana-Champaign EMAIL Shie Mannor Electrical Engineering Technion Israel Institute of Technology NVIDIA Research EMAIL
Pseudocode No The paper describes mathematical derivations and theoretical results. It does not include any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statement about open-sourcing the code, nor does it include a link to a code repository or mention code in supplementary materials.
Open Datasets No The paper describes how the MDPs are constructed and transition kernels are randomly generated for simulations (e.g., 'We construct the transition kernel and the reward function in the same manner for all MDPs...' and 'We randomly generate a transition kernel...'), rather than utilizing pre-existing, publicly available datasets with access information.
Dataset Splits No The paper's simulation section details the construction of Markov Decision Processes for empirical validation, including varying parameters like state and action space cardinalities and reward functions. It does not involve standard datasets with explicit training, test, or validation splits.
Hardware Specification No The paper does not provide any specific details regarding the hardware (e.g., GPU/CPU models, memory) used to conduct the simulations or experiments.
Software Dependencies No The paper does not list any specific software dependencies or their version numbers, such as programming languages, libraries, or frameworks, that were used for the implementation or experiments.
Experiment Setup No The paper describes aspects of the simulation environment, such as the number of iterations for projected policy gradient ('Projected policy gradient was implemented for 2000 iterations') and the construction of MDPs. However, it does not provide specific hyperparameters for the policy gradient algorithm, such as a concrete learning rate value for the simulations or other training configurations.