reproducibilityindex.ai

Planning with Expectation Models

Authors: Yi Wan, Muhammad Zaheer, Adam White, Martha White, Richard S. Sutton

IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	7 Experiments The goal of the experiment section is to validate the theoretical results and investigate how Gradient Dyna algorithm performs in practice.
Researcher Affiliation	Academia	Yi Wan , Muhammad Zaheer , Adam White , Martha White and Richard S. Sutton Reinforcement Learning and Artiﬁcial Intelligence Laboratory, University of Alberta {wan6, mzaheer, amw8, whitem, rsutton}@ualberta.ca
Pseudocode	Yes	Algorithm 1 Gradient Dyna Algorithm
Open Source Code	No	No explicit statement about providing open-source code or a link to a code repository.
Open Datasets	Yes	We evaluate the proposed method for the non-linear model choice in two simple yet illustrative domains: stochastic variants of Four Rooms [Sutton et al., 1999; Ghiassian et al., 2018] and Mountain Car [Sutton, 1996].
Dataset Splits	No	The results are reported for hyperparameters chosen based on RMSE over the latter half of a run.
Hardware Specification	No	No specific hardware details (GPU/CPU models, processor types, memory amounts, or detailed computer specifications) are provided.
Software Dependencies	No	No specific ancillary software details (library or solver names with version numbers) are provided.
Experiment Setup	Yes	For TD(0) with non-linear model and Gradient Dyna, we use a neural network with one hidden layer of 200 units as the non-linear model. We initialize the non-linear model using Xavier initialization [Glorot and Bengio, 2010]. The model is learned in an online fashion, that is, we use only the most recent sample to perform a gradient-descent update on the mean-square error. We used tile coding [Sutton, 1996] to obtain feature vector(4 x 2 x 2 tilings). We again used tile coding to obtain feature vector (8 x 8 x 8 tilings). We inject stochasticity in the environments by only executing the chosen action 70% of the times, whereas a random action is executed 30% of the time.