Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
A Deeper Look at Planning as Learning from Replay
Authors: Harm Vanseijen, Rich Sutton
ICML 2015 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To demonstrate the importance of multi-step models we applied this method to a small control problem with substantial function approximation. Whereas using a multi-step model resulted in fast convergence, the method using a one-step model failed to perform consistently. |
| Researcher Affiliation | Academia | Harm van Seijen EMAIL Richard S. Sutton EMAIL Department of Computing Science, University of Alberta, Edmonton, Alberta, T6G 2E8, Canada |
| Pseudocode | Yes | Algorithm 1 Replaying TD(0) updates; Algorithm 2 Planning with the linear Dyna model; Algorithm 3 General Planning by Replay; Algorithm 4 replay; Algorithm 5 compute targets; Algorithm 6 update weights; Algorithm 7 Forgetful LSTD(λ) |
| Open Source Code | Yes | The code for this experiment can be found on https://github.com/vanseijen/singlestep-vs-multistep. |
| Open Datasets | Yes | To demonstrate the importance of a multi-step models, we performed a comparison on the mountain car task (Sutton & Barto, 1998) |
| Dataset Splits | No | The paper describes the setup of the Mountain Car task and the number of episodes for evaluation, but it does not specify explicit training/validation/test dataset splits with percentages or sample counts, as is common for fixed datasets. The data is generated dynamically through interaction with the environment. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers like Python 3.8, PyTorch 1.9) needed to replicate the experiment. |
| Experiment Setup | Yes | The learning methods used are LS-Sarsa(λ) with λ = 0 and λ = 0.95. We used α = 0.01/3 , k = 1 and θinit = 0, and ϵ-greedy exploration with ϵ = 0.01. In addition, we used the settings β = α, dinit = θinit/α and Ainit = I/α. |