Planning with Expectation Models
Authors: Yi Wan, Muhammad Zaheer, Adam White, Martha White, Richard S. Sutton
IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 7 Experiments The goal of the experiment section is to validate the theoretical results and investigate how Gradient Dyna algorithm performs in practice. |
| Researcher Affiliation | Academia | Yi Wan , Muhammad Zaheer , Adam White , Martha White and Richard S. Sutton Reinforcement Learning and Artiļ¬cial Intelligence Laboratory, University of Alberta {wan6, mzaheer, amw8, whitem, rsutton}@ualberta.ca |
| Pseudocode | Yes | Algorithm 1 Gradient Dyna Algorithm |
| Open Source Code | No | No explicit statement about providing open-source code or a link to a code repository. |
| Open Datasets | Yes | We evaluate the proposed method for the non-linear model choice in two simple yet illustrative domains: stochastic variants of Four Rooms [Sutton et al., 1999; Ghiassian et al., 2018] and Mountain Car [Sutton, 1996]. |
| Dataset Splits | No | The results are reported for hyperparameters chosen based on RMSE over the latter half of a run. |
| Hardware Specification | No | No specific hardware details (GPU/CPU models, processor types, memory amounts, or detailed computer specifications) are provided. |
| Software Dependencies | No | No specific ancillary software details (library or solver names with version numbers) are provided. |
| Experiment Setup | Yes | For TD(0) with non-linear model and Gradient Dyna, we use a neural network with one hidden layer of 200 units as the non-linear model. We initialize the non-linear model using Xavier initialization [Glorot and Bengio, 2010]. The model is learned in an online fashion, that is, we use only the most recent sample to perform a gradient-descent update on the mean-square error. We used tile coding [Sutton, 1996] to obtain feature vector(4 x 2 x 2 tilings). We again used tile coding to obtain feature vector (8 x 8 x 8 tilings). We inject stochasticity in the environments by only executing the chosen action 70% of the times, whereas a random action is executed 30% of the time. |