Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Planning with Expectation Models
Authors: Yi Wan, Muhammad Zaheer, Adam White, Martha White, Richard S. Sutton
IJCAI 2019 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 7 Experiments The goal of the experiment section is to validate the theoretical results and investigate how Gradient Dyna algorithm performs in practice. |
| Researcher Affiliation | Academia | Yi Wan , Muhammad Zaheer , Adam White , Martha White and Richard S. Sutton Reinforcement Learning and Arti๏ฌcial Intelligence Laboratory, University of Alberta EMAIL |
| Pseudocode | Yes | Algorithm 1 Gradient Dyna Algorithm |
| Open Source Code | No | No explicit statement about providing open-source code or a link to a code repository. |
| Open Datasets | Yes | We evaluate the proposed method for the non-linear model choice in two simple yet illustrative domains: stochastic variants of Four Rooms [Sutton et al., 1999; Ghiassian et al., 2018] and Mountain Car [Sutton, 1996]. |
| Dataset Splits | No | The results are reported for hyperparameters chosen based on RMSE over the latter half of a run. |
| Hardware Specification | No | No specific hardware details (GPU/CPU models, processor types, memory amounts, or detailed computer specifications) are provided. |
| Software Dependencies | No | No specific ancillary software details (library or solver names with version numbers) are provided. |
| Experiment Setup | Yes | For TD(0) with non-linear model and Gradient Dyna, we use a neural network with one hidden layer of 200 units as the non-linear model. We initialize the non-linear model using Xavier initialization [Glorot and Bengio, 2010]. The model is learned in an online fashion, that is, we use only the most recent sample to perform a gradient-descent update on the mean-square error. We used tile coding [Sutton, 1996] to obtain feature vector(4 x 2 x 2 tilings). We again used tile coding to obtain feature vector (8 x 8 x 8 tilings). We inject stochasticity in the environments by only executing the chosen action 70% of the times, whereas a random action is executed 30% of the time. |