Policy and Value Transfer in Lifelong Reinforcement Learning

Authors: David Abel, Yuu Jinnai, Sophie Yue Guo, George Konidaris, Michael Littman

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically demonstrate the relative performance of each policy class optimal element in a variety of simple task distributions. We evaluate each algorithm empirically in a collection of simple lifelong RL tasks.
Researcher Affiliation Academia 1Department of Computer Science, Brown University, Providence, RI 02912. Correspondence to: David Abel <david abel@brown.edu>, Yuu Jinnai <yuu jinnai@brown.edu>.
Pseudocode Yes Algorithm 1 MAXQINIT
Open Source Code Yes Our code is freely available for reproducibility and extension.1 1https://github.com/david-abel/transfer_rl_icml_2018.
Open Datasets No The paper describes custom grid-world environments and task distributions from which tasks are sampled (e.g., 'For R D, we use a use-typical 11 11 grid world task distribution'). It does not refer to a pre-existing, publicly available dataset with a direct link, DOI, or formal citation.
Dataset Splits No The paper describes sampling tasks from a distribution and running agents for a number of steps or episodes, but does not define explicit training, validation, and test dataset splits in the traditional supervised learning sense.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory, or specific computing environments) used for running its experiments.
Software Dependencies No The paper describes the algorithms and experimental settings but does not list specific software dependencies with their version numbers (e.g., Python, PyTorch, or other libraries).
Experiment Setup Yes For Q-Learning, we used ε-greedy action selection with ε = 0.1, and set learning rate α = 0.1. For R-Max, we set the knownness threshold to 10, the number of experiences m of a state-action pair before an update to be allowed to 5, the number of value iterations to 5, and for Delayed-Q we set m = 5 and a constant exploration bonus ϵ1 = 0.1. In all experiments, the agent samples a goal uniformly at random, interacts with the resulting MDP for 100 episodes with 100 of steps per episode.