Average-Reward Learning and Planning with Options
Authors: Yi Wan, Abhishek Naik, Rich Sutton
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our contributions include general convergent off-policy inter-option learning algorithms, intra-option algorithms for learning values and models, as well as samplebased planning variants of our learning algorithms. Our algorithms and convergence proofs extend those recently developed by Wan, Naik, and Sutton. We also extend the notion of option-interrupting behavior from the discounted to the average-reward formulation. We show the efficacy of the proposed algorithms with experiments on a continuing version of the Four-Room domain. |
| Researcher Affiliation | Collaboration | Yi Wan , Abhishek Naik , Richard S. Sutton {wan6,anaik1,rsutton}@ualberta.ca University of Alberta, Amii Deep Mind Edmonton, Canada Edmonton, Canada |
| Pseudocode | No | The paper presents algorithmic equations (e.g., (3)-(9) for inter-option Differential Q-learning) but not formal pseudocode blocks or algorithms labeled as such. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code or a link to a code repository for the methodology described. |
| Open Datasets | No | The paper uses a "continuing version of the Four-Room domain", which is a simulated environment. It describes the environment and task setup but does not provide a link or specific access information for a publicly available, pre-collected dataset file. |
| Dataset Splits | No | The paper does not specify train/validation/test dataset splits. It describes experimental runs in a simulated environment, where agents interact with the environment over a number of steps. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used to run the experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers (e.g., library names with versions) needed to replicate the experiments. |
| Experiment Setup | Yes | For each of the two step-sizes αn and βn, we tested five choices: 2 x, x {1, 3, 5, 7, 9}. In addition, we tested five choices of η : 10 x, x {0, 1, 2, 3, 4}. Q and R were initialized to 0, L to 1. Each parameter setting was run for 200,000 steps and repeated 30 times. The agent used an ϵ-greedy policy with ϵ = 0.1. |