Multi-Step Reinforcement Learning: A Unifying Algorithm
Authors: Kristopher De Asis, J. Hernandez-Garcia, G. Holland, Richard Sutton
AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our results show that an intermediate value of σ, which results in a mixture of the existing algorithms, performs better than either extreme. The mixture can also be varied dynamically which can result in even greater performance. |
| Researcher Affiliation | Academia | Kristopher De Asis,1 J. Fernando Hernandez-Garcia,1 G. Zacharias Holland,1 Richard S. Sutton Reinforcement Learning and Artificial Intelligence Laboratory, University of Alberta {kldeasis,jfhernan,gholland,rsutton}@ualberta.ca |
| Pseudocode | Yes | Algorithm 1 Off-policy n-step Q(σ) for estimating qπ |
| Open Source Code | No | The paper does not provide any explicit statements about releasing source code or links to a code repository for the methodology described. |
| Open Datasets | No | The paper describes environments like '19-State Random Walk', 'Stochastic Windy Gridworld', and 'Mountain Cliff' which are simulated tasks based on existing RL literature (e.g., Sutton and Barto 1998). However, it does not provide concrete access information (links, DOIs, specific citations to public datasets with authors/years) for externally available, fixed datasets used in the experiments. The data is generated through agent interaction within these described environments. |
| Dataset Splits | No | The paper describes running simulations with episodes and runs (e.g., '1000 runs of 100 episodes each'), but it does not specify explicit train/validation/test dataset splits (e.g., percentages, sample counts, or citations to predefined splits) for fixed datasets. The experiments are conducted within dynamic reinforcement learning environments. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used to run the experiments (e.g., GPU/CPU models, memory, cloud resources). |
| Software Dependencies | No | The paper mentions 'tile coding function approximation' and 'version 3 of Sutton s tile coding software (n.d.)'. While it names software, the version is given as 'n.d.' (no date), which is not a specific version number needed for reproducibility. No other software dependencies with specific version numbers are mentioned. |
| Experiment Setup | Yes | All instances of the algorithms behaved and learned according to an ϵ-greedy policy, with ϵ = 0.1. ... All training was done on-policy under an ϵ-greedy policy with ϵ = 0.1 and γ = 1. We optimized for the average return after 500 episodes over different values of the step size parameter, α, and the backup length, n. The results correspond to the best-performing parameter combination for each algorithm: α = 1/6 and n = 4 for Sarsa; α = 1/6 and n = 8 for Tree-backup; α = 1/4 and n = 4 for Q(0.5); and α = 1/7 and n = 8 for Dynamic σ. |