reproducibilityindex.ai

Reward-Weighted Regression Converges to a Global Optimum

Authors: Miroslav Štrupl, Francesco Faccio, Dylan R. Ashley, Rupesh Kumar Srivastava, Jürgen Schmidhuber8361-8369

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we provide for the ﬁrst time a proof that RWR converges to a global optimum when no function approximation is used, in a general compact setting. Furthermore, for the simpler case with ﬁnite state and action spaces we prove R-linear convergence of the state-value function to the optimum. [...] Section 6 illustrates experimentally that for a simple MDP the presented update scheme converges to the optimal policy; [...] The bottom left of Figure 1 shows the root-mean-squared value error (RMSVE) of the learned policy at each iteration as compared to the optimal policy, while the bottom right shows the return obtained by the learned policy at each iteration. Smooth convergence can be observed under reward-weighted regression.
Researcher Affiliation	Collaboration	1 The Swiss AI Lab IDSIA, Universit a della Svizzera italiana (USI) & SUPSI, Lugano, Switzerland 2 NNAISENSE, Lugano, Switzerland 3 King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia {struplm, francesco, dylan.ashley}@idsia.ch, rupesh@nnaisense.com, juergen@idsia.ch
Pseudocode	No	The paper does not contain a pseudocode block or a clearly labeled algorithm.
Open Source Code	Yes	The source code for this experiment is available at https://github.com/dylanashley/rewardweighted-regression.
Open Datasets	Yes	In particular, we ensure that rewards are positive and that there is no function approximations for value functions and policies. In order to meet these criteria, we use the modiﬁed four-room gridworld domain (Sutton, Precup, and Singh 1999) shown on the left of Figure 1.
Dataset Splits	No	The paper describes the environment and how policy performance is measured (RMSVE, return), but does not provide specific training, validation, or test dataset splits.
Hardware Specification	No	The paper mentions support from various organizations and donations of hardware (e.g., "DGX-1", "Minsky machine"), but it does not explicitly state which specific hardware was used to run the experiments described in the paper.
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers.
Experiment Setup	Yes	In particular, we ensure that rewards are positive and that there is no function approximations for value functions and policies. In order to meet these criteria, we use the modiﬁed four-room gridworld domain (Sutton, Precup, and Singh 1999) shown on the left of Figure 1. [...] The agent receives a reward of 1 when transitioning from a non-goal state to the goal state and a reward of 0.001 otherwise. The discountrate is 0.9 at each step. [...] All lines are averages of 100 runs under different uniform random initial policies.