Planning with Expectation Models

Authors: Yi Wan, Muhammad Zaheer, Adam White, Martha White, Richard S. Sutton

IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 7 Experiments The goal of the experiment section is to validate the theoretical results and investigate how Gradient Dyna algorithm performs in practice.
Researcher Affiliation Academia Yi Wan , Muhammad Zaheer , Adam White , Martha White and Richard S. Sutton Reinforcement Learning and Artificial Intelligence Laboratory, University of Alberta {wan6, mzaheer, amw8, whitem, rsutton}@ualberta.ca
Pseudocode Yes Algorithm 1 Gradient Dyna Algorithm
Open Source Code No No explicit statement about providing open-source code or a link to a code repository.
Open Datasets Yes We evaluate the proposed method for the non-linear model choice in two simple yet illustrative domains: stochastic variants of Four Rooms [Sutton et al., 1999; Ghiassian et al., 2018] and Mountain Car [Sutton, 1996].
Dataset Splits No The results are reported for hyperparameters chosen based on RMSE over the latter half of a run.
Hardware Specification No No specific hardware details (GPU/CPU models, processor types, memory amounts, or detailed computer specifications) are provided.
Software Dependencies No No specific ancillary software details (library or solver names with version numbers) are provided.
Experiment Setup Yes For TD(0) with non-linear model and Gradient Dyna, we use a neural network with one hidden layer of 200 units as the non-linear model. We initialize the non-linear model using Xavier initialization [Glorot and Bengio, 2010]. The model is learned in an online fashion, that is, we use only the most recent sample to perform a gradient-descent update on the mean-square error. We used tile coding [Sutton, 1996] to obtain feature vector(4 x 2 x 2 tilings). We again used tile coding to obtain feature vector (8 x 8 x 8 tilings). We inject stochasticity in the environments by only executing the chosen action 70% of the times, whereas a random action is executed 30% of the time.