Reward-Weighted Regression Converges to a Global Optimum
Authors: Miroslav Štrupl, Francesco Faccio, Dylan R. Ashley, Rupesh Kumar Srivastava, Jürgen Schmidhuber8361-8369
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we provide for the first time a proof that RWR converges to a global optimum when no function approximation is used, in a general compact setting. Furthermore, for the simpler case with finite state and action spaces we prove R-linear convergence of the state-value function to the optimum. [...] Section 6 illustrates experimentally that for a simple MDP the presented update scheme converges to the optimal policy; [...] The bottom left of Figure 1 shows the root-mean-squared value error (RMSVE) of the learned policy at each iteration as compared to the optimal policy, while the bottom right shows the return obtained by the learned policy at each iteration. Smooth convergence can be observed under reward-weighted regression. |
| Researcher Affiliation | Collaboration | 1 The Swiss AI Lab IDSIA, Universit a della Svizzera italiana (USI) & SUPSI, Lugano, Switzerland 2 NNAISENSE, Lugano, Switzerland 3 King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia {struplm, francesco, dylan.ashley}@idsia.ch, rupesh@nnaisense.com, juergen@idsia.ch |
| Pseudocode | No | The paper does not contain a pseudocode block or a clearly labeled algorithm. |
| Open Source Code | Yes | The source code for this experiment is available at https://github.com/dylanashley/rewardweighted-regression. |
| Open Datasets | Yes | In particular, we ensure that rewards are positive and that there is no function approximations for value functions and policies. In order to meet these criteria, we use the modified four-room gridworld domain (Sutton, Precup, and Singh 1999) shown on the left of Figure 1. |
| Dataset Splits | No | The paper describes the environment and how policy performance is measured (RMSVE, return), but does not provide specific training, validation, or test dataset splits. |
| Hardware Specification | No | The paper mentions support from various organizations and donations of hardware (e.g., "DGX-1", "Minsky machine"), but it does not explicitly state which specific hardware was used to run the experiments described in the paper. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers. |
| Experiment Setup | Yes | In particular, we ensure that rewards are positive and that there is no function approximations for value functions and policies. In order to meet these criteria, we use the modified four-room gridworld domain (Sutton, Precup, and Singh 1999) shown on the left of Figure 1. [...] The agent receives a reward of 1 when transitioning from a non-goal state to the goal state and a reward of 0.001 otherwise. The discountrate is 0.9 at each step. [...] All lines are averages of 100 runs under different uniform random initial policies. |