reproducibilityindex.ai

A Deeper Look at Planning as Learning from Replay

Authors: Harm Vanseijen, Rich Sutton

ICML 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To demonstrate the importance of multi-step models we applied this method to a small control problem with substantial function approximation. Whereas using a multi-step model resulted in fast convergence, the method using a one-step model failed to perform consistently.
Researcher Affiliation	Academia	Harm van Seijen HARM.VANSEIJEN@UALBERTA.CA Richard S. Sutton SUTTON@CS.UALBERTA.CA Department of Computing Science, University of Alberta, Edmonton, Alberta, T6G 2E8, Canada
Pseudocode	Yes	Algorithm 1 Replaying TD(0) updates; Algorithm 2 Planning with the linear Dyna model; Algorithm 3 General Planning by Replay; Algorithm 4 replay; Algorithm 5 compute targets; Algorithm 6 update weights; Algorithm 7 Forgetful LSTD(λ)
Open Source Code	Yes	The code for this experiment can be found on https://github.com/vanseijen/singlestep-vs-multistep.
Open Datasets	Yes	To demonstrate the importance of a multi-step models, we performed a comparison on the mountain car task (Sutton & Barto, 1998)
Dataset Splits	No	The paper describes the setup of the Mountain Car task and the number of episodes for evaluation, but it does not specify explicit training/validation/test dataset splits with percentages or sample counts, as is common for fixed datasets. The data is generated dynamically through interaction with the environment.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies	No	The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers like Python 3.8, PyTorch 1.9) needed to replicate the experiment.
Experiment Setup	Yes	The learning methods used are LS-Sarsa(λ) with λ = 0 and λ = 0.95. We used α = 0.01/3 , k = 1 and θinit = 0, and ϵ-greedy exploration with ϵ = 0.01. In addition, we used the settings β = α, dinit = θinit/α and Ainit = I/α.