Multi-Step Reinforcement Learning: A Unifying Algorithm

Authors: Kristopher De Asis, J. Hernandez-Garcia, G. Holland, Richard Sutton

AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our results show that an intermediate value of σ, which results in a mixture of the existing algorithms, performs better than either extreme. The mixture can also be varied dynamically which can result in even greater performance.
Researcher Affiliation Academia Kristopher De Asis,1 J. Fernando Hernandez-Garcia,1 G. Zacharias Holland,1 Richard S. Sutton Reinforcement Learning and Artificial Intelligence Laboratory, University of Alberta {kldeasis,jfhernan,gholland,rsutton}@ualberta.ca
Pseudocode Yes Algorithm 1 Off-policy n-step Q(σ) for estimating qπ
Open Source Code No The paper does not provide any explicit statements about releasing source code or links to a code repository for the methodology described.
Open Datasets No The paper describes environments like '19-State Random Walk', 'Stochastic Windy Gridworld', and 'Mountain Cliff' which are simulated tasks based on existing RL literature (e.g., Sutton and Barto 1998). However, it does not provide concrete access information (links, DOIs, specific citations to public datasets with authors/years) for externally available, fixed datasets used in the experiments. The data is generated through agent interaction within these described environments.
Dataset Splits No The paper describes running simulations with episodes and runs (e.g., '1000 runs of 100 episodes each'), but it does not specify explicit train/validation/test dataset splits (e.g., percentages, sample counts, or citations to predefined splits) for fixed datasets. The experiments are conducted within dynamic reinforcement learning environments.
Hardware Specification No The paper does not provide any specific details about the hardware used to run the experiments (e.g., GPU/CPU models, memory, cloud resources).
Software Dependencies No The paper mentions 'tile coding function approximation' and 'version 3 of Sutton s tile coding software (n.d.)'. While it names software, the version is given as 'n.d.' (no date), which is not a specific version number needed for reproducibility. No other software dependencies with specific version numbers are mentioned.
Experiment Setup Yes All instances of the algorithms behaved and learned according to an ϵ-greedy policy, with ϵ = 0.1. ... All training was done on-policy under an ϵ-greedy policy with ϵ = 0.1 and γ = 1. We optimized for the average return after 500 episodes over different values of the step size parameter, α, and the backup length, n. The results correspond to the best-performing parameter combination for each algorithm: α = 1/6 and n = 4 for Sarsa; α = 1/6 and n = 8 for Tree-backup; α = 1/4 and n = 4 for Q(0.5); and α = 1/7 and n = 8 for Dynamic σ.