Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

$\phi$-Update: A Class of Policy Update Methods with Policy Convergence Guarantee

Authors: Wenye Li, Jiacai Liu, Ke Wei

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	4 EMPIRICAL VALIDATIONS AND DISCUSSION We empirically validate Theorems 3.3 and 3.4 using ϕ(t) = exp(t) (corresponding to softmax NPG), ϕ(t) = (1 + exp( t)) 1, and ϕ(t) = tan(t) + 1. The step size is set to η = 1, η = 1, and η = 0.1, respectively. For Theorem 3.3, we use a simple 5 5 Grid World problem to test the policy convergence of ϕ-update. The computational results are presented in Figure 2. For Theorem 3.4, we use the random MDP to valid the exact asymptotic rate. The computational results are presented in Figure 3. In addition, more examples of ϕ-update have been tested, including extensions of softmax NPG and a family of polynomial update, see Appendix F. We have conducted some preliminary numerical experiments on ϕ-update under the neural network parameterization for the exponential family presented in Section F.1.
Researcher Affiliation	Academia	Wenye Li , Jiacai Liu , Ke Wei School of Data Science, Fudan University
Pseudocode	No	The paper defines the ϕ-update rule mathematically: (ϕ-update) s S, a A : π+(a\|s) π(a\|s) ϕ (ηπ s Aπ(s, a)). While it describes the update process, it does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block with structured steps.
Open Source Code	No	The paper states: 'Preliminary numerical results about the implementation of ϕ-update under the neural network parameterization have been reported in Appendix G.' and 'comprehensive empirical evaluations of ϕ-update is beyond the scope of this paper'. It does not explicitly state that the source code for the described methodology is being released, nor does it provide a link to a repository.
Open Datasets	No	For Theorem 3.3, we use a simple 5 5 Grid World problem to test the policy convergence of ϕ-update. For Theorem 3.4, we use the random MDP to valid the exact asymptotic rate. The reward r(s, a) and transition probability P(s \|s, a) are uniformly generated from [0, 1] (P is further normalized to be a probability matrix). Three environments from Mu Jo Co are tested. The paper describes custom-generated environments (Grid World, random MDP) and standard simulation environments (MuJoCo) but does not provide concrete access information (link, DOI, citation) to publicly available datasets used in the experiments.
Dataset Splits	No	The paper describes how the simulation environments are set up, for example: 'The discount factor γ is set to 0.9, and we use a uniform policy as the initial policy' for the Grid World, and 'The reward r(s, a) and transition probability P(s \|s, a) are uniformly generated from [0, 1] (P is further normalized to be a probability matrix). The initial state distribution µ is uniform on S.' for the random MDP. These are environment generation details, not specific training/test/validation dataset splits.
Hardware Specification	No	The paper states: 'We have conducted some preliminary numerical experiments on ϕ-update under the neural network parameterization for the exponential family presented in Section F.1. Three environments from Mu Jo Co are tested'. No specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments are mentioned.
Software Dependencies	No	The paper mentions: 'Three environments from Mu Jo Co are tested' and 'The policy neural network is a two-layers MLP with 64 units per layer'. However, it does not provide specific version numbers for any software components, libraries, or environments (e.g., MuJoCo version, Python version, deep learning framework version).
Experiment Setup	Yes	The step size is set to η = 1, η = 1, and η = 0.1, respectively. The discount factor γ is set to 0.9, and we use a uniform policy as the initial policy. The reward r(s, a) and transition probability P(s \|s, a) are uniformly generated from [0, 1] (P is further normalized to be a probability matrix). The initial state distribution µ is uniform on S. The policy neural network is a two-layers MLP with 64 units per layer, and the timesteps are 5 millions for each experiment. For each experiment, we compute the mean and the standard deviation of the final accumulative reward across 10 different random seeds.