reproducibilityindex.ai

Average-Reward Learning and Planning with Options

Authors: Yi Wan, Abhishek Naik, Rich Sutton

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our contributions include general convergent off-policy inter-option learning algorithms, intra-option algorithms for learning values and models, as well as samplebased planning variants of our learning algorithms. Our algorithms and convergence proofs extend those recently developed by Wan, Naik, and Sutton. We also extend the notion of option-interrupting behavior from the discounted to the average-reward formulation. We show the efﬁcacy of the proposed algorithms with experiments on a continuing version of the Four-Room domain.
Researcher Affiliation	Collaboration	Yi Wan , Abhishek Naik , Richard S. Sutton {wan6,anaik1,rsutton}@ualberta.ca University of Alberta, Amii Deep Mind Edmonton, Canada Edmonton, Canada
Pseudocode	No	The paper presents algorithmic equations (e.g., (3)-(9) for inter-option Differential Q-learning) but not formal pseudocode blocks or algorithms labeled as such.
Open Source Code	No	The paper does not contain an explicit statement about releasing source code or a link to a code repository for the methodology described.
Open Datasets	No	The paper uses a "continuing version of the Four-Room domain", which is a simulated environment. It describes the environment and task setup but does not provide a link or specific access information for a publicly available, pre-collected dataset file.
Dataset Splits	No	The paper does not specify train/validation/test dataset splits. It describes experimental runs in a simulated environment, where agents interact with the environment over a number of steps.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used to run the experiments.
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers (e.g., library names with versions) needed to replicate the experiments.
Experiment Setup	Yes	For each of the two step-sizes αn and βn, we tested ﬁve choices: 2 x, x {1, 3, 5, 7, 9}. In addition, we tested ﬁve choices of η : 10 x, x {0, 1, 2, 3, 4}. Q and R were initialized to 0, L to 1. Each parameter setting was run for 200,000 steps and repeated 30 times. The agent used an ϵ-greedy policy with ϵ = 0.1.