PAC-inspired Option Discovery in Lifelong Reinforcement Learning
Authors: Emma Brunskill, Lihong Li
ICML 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We make several contributions in this work. First, we present the first formal analysis of how options can improve the sample complexity of RL. This analysis provides a solid theoretical foundation to explain some of the prior empirical successes (and surprises). Second, we discuss the conditions under which we can discover a set of options during lifelong learning across a series of RL tasks that is guaranteed to reduce or match the sample complexity of single-task learning in future RL tasks. Third, we present a two-phase lifelong learning algorithm that performs option discovery and then leverages the prior options. Finally, this approach enables a significant performance improvement over RL without options on a prior benchmark simulation domain, as well as a strong option discovery baseline, and does almost as well as using a set of expert-constructed options. Under certain conditions, our discovered options are guaranteed not to lead to a particular form of negative transfer: using the discovered options will not result in a worse sample complexity bound for later tasks, compared to that with primitive actions. This is important because negative transfer, where prior knowledge harms future performance, is a common challenge in lifelong learning. |
| Researcher Affiliation | Collaboration | Emma Brunskill EBRUNSKILL@CS.CMU.EDU Computer Science Department, Carnegie Mellon University, Pittsburgh, PA 1513 Lihong Li LIHONGLI@MICROSOFT.COM Microsoft Research, One Microsoft Way, Redmond, WA 98052 |
| Pseudocode | Yes | Algorithm 1 PAC-Inspired Option-Discovery 1: Input: M, ϵ 2: while uncovered or remaining terminal (s, m) pairs do 3: Initialize new option o from one such state s 4: Assign action to πo(s) which covers most MDPs 5: while new option improves sample complexity do 6: Add successors s reachable under current option state(s) and policy 7: l is an assignment of ϵ-optimal actions to s set 8: Found Improvement = 0 9: while Found Improvement == 0 and haven t tried all l do 10: sc(l) = sample complexity bound given l 11: if scl(l) improves over sc without option then 12: Found Improvement = 1 13: Add s as leaves of o 14: else 15: Create a new assignment l 16: end if 17: end while 18: end while 19: Mark s as terminal of o, and add o to set of options 20: Update lists of uncovered (s, m) pairs and terminals 21: end while |
| Open Source Code | No | The paper does not provide an explicit statement about releasing source code or a link to a code repository. |
| Open Datasets | Yes | We consider a non-episodic variant of the four-room maze of Sutton et al. (1999). |
| Dataset Splits | No | The paper describes a two-phase experimental setup where Phase 1 tasks are used for option discovery and Phase 2 tasks are used for evaluation. However, it does not specify explicit train/validation/test splits in the conventional sense for a single dataset, nor does it explicitly mention a 'validation set'. |
| Hardware Specification | No | The paper does not provide any specific hardware details such as GPU/CPU models, memory, or cloud instance types used for running the experiments. |
| Software Dependencies | No | The paper mentions algorithms like SMDP-RMAX and E3 PAC-MDP, but it does not specify any software dependencies (e.g., programming languages, libraries, frameworks) or their version numbers. |
| Experiment Setup | Yes | The discount factor γ was set to 0.9 and in each task the agent acts for 200,000 steps. During phase 1 the agent executes the E3 PAC-MDP algorithm in each task, and constructs an estimate of its optimal state acton values. After phase 1, the agent uses the resulting policies to perform option discovery... To further reduce computation, we bound the number of considered successors s at each stage of option construction. Our experiments in Section 6 used 4 as the threshold... We sampled 100 tasks with the structure just described, and report in Table 1 the average of the total rewards in these tasks for each of the algorithms... We thus manually searched, for each baseline algorithm, over c values, and report returns for the best possible settings. |