Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Policy and Value Transfer in Lifelong Reinforcement Learning

Authors: David Abel, Yuu Jinnai, Sophie Yue Guo, George Konidaris, Michael Littman

ICML 2018 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically demonstrate the relative performance of each policy class optimal element in a variety of simple task distributions. We evaluate each algorithm empirically in a collection of simple lifelong RL tasks.
Researcher Affiliation Academia 1Department of Computer Science, Brown University, Providence, RI 02912. Correspondence to: David Abel <david EMAIL>, Yuu Jinnai <yuu EMAIL>.
Pseudocode Yes Algorithm 1 MAXQINIT
Open Source Code Yes Our code is freely available for reproducibility and extension.1 1https://github.com/david-abel/transfer_rl_icml_2018.
Open Datasets No The paper describes custom grid-world environments and task distributions from which tasks are sampled (e.g., 'For R D, we use a use-typical 11 11 grid world task distribution'). It does not refer to a pre-existing, publicly available dataset with a direct link, DOI, or formal citation.
Dataset Splits No The paper describes sampling tasks from a distribution and running agents for a number of steps or episodes, but does not define explicit training, validation, and test dataset splits in the traditional supervised learning sense.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory, or specific computing environments) used for running its experiments.
Software Dependencies No The paper describes the algorithms and experimental settings but does not list specific software dependencies with their version numbers (e.g., Python, PyTorch, or other libraries).
Experiment Setup Yes For Q-Learning, we used ε-greedy action selection with ε = 0.1, and set learning rate α = 0.1. For R-Max, we set the knownness threshold to 10, the number of experiences m of a state-action pair before an update to be allowed to 5, the number of value iterations to 5, and for Delayed-Q we set m = 5 and a constant exploration bonus ϵ1 = 0.1. In all experiments, the agent samples a goal uniformly at random, interacts with the resulting MDP for 100 episodes with 100 of steps per episode.