A Deeper Look at Planning as Learning from Replay
Authors: Harm Vanseijen, Rich Sutton
ICML 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To demonstrate the importance of multi-step models we applied this method to a small control problem with substantial function approximation. Whereas using a multi-step model resulted in fast convergence, the method using a one-step model failed to perform consistently. |
| Researcher Affiliation | Academia | Harm van Seijen HARM.VANSEIJEN@UALBERTA.CA Richard S. Sutton SUTTON@CS.UALBERTA.CA Department of Computing Science, University of Alberta, Edmonton, Alberta, T6G 2E8, Canada |
| Pseudocode | Yes | Algorithm 1 Replaying TD(0) updates; Algorithm 2 Planning with the linear Dyna model; Algorithm 3 General Planning by Replay; Algorithm 4 replay; Algorithm 5 compute targets; Algorithm 6 update weights; Algorithm 7 Forgetful LSTD(λ) |
| Open Source Code | Yes | The code for this experiment can be found on https://github.com/vanseijen/singlestep-vs-multistep. |
| Open Datasets | Yes | To demonstrate the importance of a multi-step models, we performed a comparison on the mountain car task (Sutton & Barto, 1998) |
| Dataset Splits | No | The paper describes the setup of the Mountain Car task and the number of episodes for evaluation, but it does not specify explicit training/validation/test dataset splits with percentages or sample counts, as is common for fixed datasets. The data is generated dynamically through interaction with the environment. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers like Python 3.8, PyTorch 1.9) needed to replicate the experiment. |
| Experiment Setup | Yes | The learning methods used are LS-Sarsa(λ) with λ = 0 and λ = 0.95. We used α = 0.01/3 , k = 1 and θinit = 0, and ϵ-greedy exploration with ϵ = 0.01. In addition, we used the settings β = α, dinit = θinit/α and Ainit = I/α. |