Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

$\phi$-Update: A Class of Policy Update Methods with Policy Convergence Guarantee

Authors: Wenye Li, Jiacai Liu, Ke Wei

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 EMPIRICAL VALIDATIONS AND DISCUSSION We empirically validate Theorems 3.3 and 3.4 using ϕ(t) = exp(t) (corresponding to softmax NPG), ϕ(t) = (1 + exp( t)) 1, and ϕ(t) = tan(t) + 1. The step size is set to η = 1, η = 1, and η = 0.1, respectively. For Theorem 3.3, we use a simple 5 5 Grid World problem to test the policy convergence of ϕ-update. The computational results are presented in Figure 2. For Theorem 3.4, we use the random MDP to valid the exact asymptotic rate. The computational results are presented in Figure 3. In addition, more examples of ϕ-update have been tested, including extensions of softmax NPG and a family of polynomial update, see Appendix F. We have conducted some preliminary numerical experiments on ϕ-update under the neural network parameterization for the exponential family presented in Section F.1.
Researcher Affiliation Academia Wenye Li , Jiacai Liu , Ke Wei School of Data Science, Fudan University
Pseudocode No The paper defines the ϕ-update rule mathematically: (ϕ-update) s S, a A : π+(a|s) π(a|s) ϕ (ηπ s Aπ(s, a)). While it describes the update process, it does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block with structured steps.
Open Source Code No The paper states: 'Preliminary numerical results about the implementation of ϕ-update under the neural network parameterization have been reported in Appendix G.' and 'comprehensive empirical evaluations of ϕ-update is beyond the scope of this paper'. It does not explicitly state that the source code for the described methodology is being released, nor does it provide a link to a repository.
Open Datasets No For Theorem 3.3, we use a simple 5 5 Grid World problem to test the policy convergence of ϕ-update. For Theorem 3.4, we use the random MDP to valid the exact asymptotic rate. The reward r(s, a) and transition probability P(s |s, a) are uniformly generated from [0, 1] (P is further normalized to be a probability matrix). Three environments from Mu Jo Co are tested. The paper describes custom-generated environments (Grid World, random MDP) and standard simulation environments (MuJoCo) but does not provide concrete access information (link, DOI, citation) to publicly available datasets used in the experiments.
Dataset Splits No The paper describes how the simulation environments are set up, for example: 'The discount factor γ is set to 0.9, and we use a uniform policy as the initial policy' for the Grid World, and 'The reward r(s, a) and transition probability P(s |s, a) are uniformly generated from [0, 1] (P is further normalized to be a probability matrix). The initial state distribution µ is uniform on S.' for the random MDP. These are environment generation details, not specific training/test/validation dataset splits.
Hardware Specification No The paper states: 'We have conducted some preliminary numerical experiments on ϕ-update under the neural network parameterization for the exponential family presented in Section F.1. Three environments from Mu Jo Co are tested'. No specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments are mentioned.
Software Dependencies No The paper mentions: 'Three environments from Mu Jo Co are tested' and 'The policy neural network is a two-layers MLP with 64 units per layer'. However, it does not provide specific version numbers for any software components, libraries, or environments (e.g., MuJoCo version, Python version, deep learning framework version).
Experiment Setup Yes The step size is set to η = 1, η = 1, and η = 0.1, respectively. The discount factor γ is set to 0.9, and we use a uniform policy as the initial policy. The reward r(s, a) and transition probability P(s |s, a) are uniformly generated from [0, 1] (P is further normalized to be a probability matrix). The initial state distribution µ is uniform on S. The policy neural network is a two-layers MLP with 64 units per layer, and the timesteps are 5 millions for each experiment. For each experiment, we compute the mean and the standard deviation of the final accumulative reward across 10 different random seeds.