Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Convergence of Policy Mirror Descent Beyond Compatible Function Approximation
Authors: Uri Sherman, Tomer Koren, Yishay Mansour
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | In this work, we develop a theoretical framework for PMD for general policy classes where we replace the closure conditions with a strictly weaker variational gradient dominance assumption, and obtain upper bounds on the rate of convergence to the best-in-class policy. Our main result leverages a novel notion of smoothness with respect to a local norm induced by the occupancy measure of the current policy, and casts PMD as a particular instance of smooth non-convex optimization in non-Euclidean space. |
| Researcher Affiliation | Collaboration | 1Blavatnik School of Computer Science and AI, Tel Aviv University, Tel Aviv, Israel 2Google Research, Tel Aviv, Israel. |
| Pseudocode | Yes | Algorithm 1 Policy Mirror Descent (on-policy) Input: learning rate eta > 0, regularizer R: R^A -> R Initialize pi_1 in Prod for k= 1 to K do Set mu_k := mu_pi_k; hat{Q}_k := hat{Q}_pi_k. pi_{k+1} = arg min_{pi in Prod} E_{s ~ mu_k}[ H * D * hat{Q}_k(s, pi_s) + (1/eta) * B_R(pi_s, pi_k(s)) ] |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code, nor does it provide links to any code repositories in the main text or supplementary sections. |
| Open Datasets | No | The paper is theoretical in nature and does not conduct experiments using specific datasets. Therefore, it does not mention the availability of any open datasets. |
| Dataset Splits | No | The paper focuses on theoretical analysis and does not involve experimental evaluation on datasets. As such, there is no mention of training/test/validation dataset splits. |
| Hardware Specification | No | The paper is a theoretical work focusing on algorithmic convergence and does not describe any experiments that would require specific hardware. No hardware specifications are mentioned. |
| Software Dependencies | No | The paper presents a theoretical framework and does not mention any specific software or libraries, along with their version numbers, that would be required to reproduce experimental results. |
| Experiment Setup | No | The paper presents a theoretical framework and does not detail any experimental setup, including hyperparameters or system-level training settings, as it does not conduct experiments. |