Average-Reward Learning and Planning with Options

Authors: Yi Wan, Abhishek Naik, Rich Sutton

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our contributions include general convergent off-policy inter-option learning algorithms, intra-option algorithms for learning values and models, as well as samplebased planning variants of our learning algorithms. Our algorithms and convergence proofs extend those recently developed by Wan, Naik, and Sutton. We also extend the notion of option-interrupting behavior from the discounted to the average-reward formulation. We show the efficacy of the proposed algorithms with experiments on a continuing version of the Four-Room domain.
Researcher Affiliation Collaboration Yi Wan , Abhishek Naik , Richard S. Sutton {wan6,anaik1,rsutton}@ualberta.ca University of Alberta, Amii Deep Mind Edmonton, Canada Edmonton, Canada
Pseudocode No The paper presents algorithmic equations (e.g., (3)-(9) for inter-option Differential Q-learning) but not formal pseudocode blocks or algorithms labeled as such.
Open Source Code No The paper does not contain an explicit statement about releasing source code or a link to a code repository for the methodology described.
Open Datasets No The paper uses a "continuing version of the Four-Room domain", which is a simulated environment. It describes the environment and task setup but does not provide a link or specific access information for a publicly available, pre-collected dataset file.
Dataset Splits No The paper does not specify train/validation/test dataset splits. It describes experimental runs in a simulated environment, where agents interact with the environment over a number of steps.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used to run the experiments.
Software Dependencies No The paper does not provide specific software dependencies with version numbers (e.g., library names with versions) needed to replicate the experiments.
Experiment Setup Yes For each of the two step-sizes αn and βn, we tested five choices: 2 x, x {1, 3, 5, 7, 9}. In addition, we tested five choices of η : 10 x, x {0, 1, 2, 3, 4}. Q and R were initialized to 0, L to 1. Each parameter setting was run for 200,000 steps and repeated 30 times. The agent used an ϵ-greedy policy with ϵ = 0.1.