Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A Deeper Look at Planning as Learning from Replay

Authors: Harm Vanseijen, Rich Sutton

ICML 2015 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To demonstrate the importance of multi-step models we applied this method to a small control problem with substantial function approximation. Whereas using a multi-step model resulted in fast convergence, the method using a one-step model failed to perform consistently.
Researcher Affiliation Academia Harm van Seijen EMAIL Richard S. Sutton EMAIL Department of Computing Science, University of Alberta, Edmonton, Alberta, T6G 2E8, Canada
Pseudocode Yes Algorithm 1 Replaying TD(0) updates; Algorithm 2 Planning with the linear Dyna model; Algorithm 3 General Planning by Replay; Algorithm 4 replay; Algorithm 5 compute targets; Algorithm 6 update weights; Algorithm 7 Forgetful LSTD(λ)
Open Source Code Yes The code for this experiment can be found on https://github.com/vanseijen/singlestep-vs-multistep.
Open Datasets Yes To demonstrate the importance of a multi-step models, we performed a comparison on the mountain car task (Sutton & Barto, 1998)
Dataset Splits No The paper describes the setup of the Mountain Car task and the number of episodes for evaluation, but it does not specify explicit training/validation/test dataset splits with percentages or sample counts, as is common for fixed datasets. The data is generated dynamically through interaction with the environment.
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers like Python 3.8, PyTorch 1.9) needed to replicate the experiment.
Experiment Setup Yes The learning methods used are LS-Sarsa(λ) with λ = 0 and λ = 0.95. We used α = 0.01/3 , k = 1 and θinit = 0, and ϵ-greedy exploration with ϵ = 0.01. In addition, we used the settings β = α, dinit = θinit/α and Ainit = I/α.