Reward-Weighted Regression Converges to a Global Optimum

Authors: Miroslav Štrupl, Francesco Faccio, Dylan R. Ashley, Rupesh Kumar Srivastava, Jürgen Schmidhuber8361-8369

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we provide for the first time a proof that RWR converges to a global optimum when no function approximation is used, in a general compact setting. Furthermore, for the simpler case with finite state and action spaces we prove R-linear convergence of the state-value function to the optimum. [...] Section 6 illustrates experimentally that for a simple MDP the presented update scheme converges to the optimal policy; [...] The bottom left of Figure 1 shows the root-mean-squared value error (RMSVE) of the learned policy at each iteration as compared to the optimal policy, while the bottom right shows the return obtained by the learned policy at each iteration. Smooth convergence can be observed under reward-weighted regression.
Researcher Affiliation Collaboration 1 The Swiss AI Lab IDSIA, Universit a della Svizzera italiana (USI) & SUPSI, Lugano, Switzerland 2 NNAISENSE, Lugano, Switzerland 3 King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia {struplm, francesco, dylan.ashley}@idsia.ch, rupesh@nnaisense.com, juergen@idsia.ch
Pseudocode No The paper does not contain a pseudocode block or a clearly labeled algorithm.
Open Source Code Yes The source code for this experiment is available at https://github.com/dylanashley/rewardweighted-regression.
Open Datasets Yes In particular, we ensure that rewards are positive and that there is no function approximations for value functions and policies. In order to meet these criteria, we use the modified four-room gridworld domain (Sutton, Precup, and Singh 1999) shown on the left of Figure 1.
Dataset Splits No The paper describes the environment and how policy performance is measured (RMSVE, return), but does not provide specific training, validation, or test dataset splits.
Hardware Specification No The paper mentions support from various organizations and donations of hardware (e.g., "DGX-1", "Minsky machine"), but it does not explicitly state which specific hardware was used to run the experiments described in the paper.
Software Dependencies No The paper does not provide specific software dependencies with version numbers.
Experiment Setup Yes In particular, we ensure that rewards are positive and that there is no function approximations for value functions and policies. In order to meet these criteria, we use the modified four-room gridworld domain (Sutton, Precup, and Singh 1999) shown on the left of Figure 1. [...] The agent receives a reward of 1 when transitioning from a non-goal state to the goal state and a reward of 0.001 otherwise. The discountrate is 0.9 at each step. [...] All lines are averages of 100 runs under different uniform random initial policies.