Policy and Value Transfer in Lifelong Reinforcement Learning
Authors: David Abel, Yuu Jinnai, Sophie Yue Guo, George Konidaris, Michael Littman
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically demonstrate the relative performance of each policy class optimal element in a variety of simple task distributions. We evaluate each algorithm empirically in a collection of simple lifelong RL tasks. |
| Researcher Affiliation | Academia | 1Department of Computer Science, Brown University, Providence, RI 02912. Correspondence to: David Abel <david abel@brown.edu>, Yuu Jinnai <yuu jinnai@brown.edu>. |
| Pseudocode | Yes | Algorithm 1 MAXQINIT |
| Open Source Code | Yes | Our code is freely available for reproducibility and extension.1 1https://github.com/david-abel/transfer_rl_icml_2018. |
| Open Datasets | No | The paper describes custom grid-world environments and task distributions from which tasks are sampled (e.g., 'For R D, we use a use-typical 11 11 grid world task distribution'). It does not refer to a pre-existing, publicly available dataset with a direct link, DOI, or formal citation. |
| Dataset Splits | No | The paper describes sampling tasks from a distribution and running agents for a number of steps or episodes, but does not define explicit training, validation, and test dataset splits in the traditional supervised learning sense. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory, or specific computing environments) used for running its experiments. |
| Software Dependencies | No | The paper describes the algorithms and experimental settings but does not list specific software dependencies with their version numbers (e.g., Python, PyTorch, or other libraries). |
| Experiment Setup | Yes | For Q-Learning, we used ε-greedy action selection with ε = 0.1, and set learning rate α = 0.1. For R-Max, we set the knownness threshold to 10, the number of experiences m of a state-action pair before an update to be allowed to 5, the number of value iterations to 5, and for Delayed-Q we set m = 5 and a constant exploration bonus ϵ1 = 0.1. In all experiments, the agent samples a goal uniformly at random, interacts with the resulting MDP for 100 episodes with 100 of steps per episode. |