reproducibilityindex.ai

Policy and Value Transfer in Lifelong Reinforcement Learning

Authors: David Abel, Yuu Jinnai, Sophie Yue Guo, George Konidaris, Michael Littman

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically demonstrate the relative performance of each policy class optimal element in a variety of simple task distributions. We evaluate each algorithm empirically in a collection of simple lifelong RL tasks.
Researcher Affiliation	Academia	1Department of Computer Science, Brown University, Providence, RI 02912. Correspondence to: David Abel <david abel@brown.edu>, Yuu Jinnai <yuu jinnai@brown.edu>.
Pseudocode	Yes	Algorithm 1 MAXQINIT
Open Source Code	Yes	Our code is freely available for reproducibility and extension.1 1https://github.com/david-abel/transfer_rl_icml_2018.
Open Datasets	No	The paper describes custom grid-world environments and task distributions from which tasks are sampled (e.g., 'For R D, we use a use-typical 11 11 grid world task distribution'). It does not refer to a pre-existing, publicly available dataset with a direct link, DOI, or formal citation.
Dataset Splits	No	The paper describes sampling tasks from a distribution and running agents for a number of steps or episodes, but does not define explicit training, validation, and test dataset splits in the traditional supervised learning sense.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory, or specific computing environments) used for running its experiments.
Software Dependencies	No	The paper describes the algorithms and experimental settings but does not list specific software dependencies with their version numbers (e.g., Python, PyTorch, or other libraries).
Experiment Setup	Yes	For Q-Learning, we used ε-greedy action selection with ε = 0.1, and set learning rate α = 0.1. For R-Max, we set the knownness threshold to 10, the number of experiences m of a state-action pair before an update to be allowed to 5, the number of value iterations to 5, and for Delayed-Q we set m = 5 and a constant exploration bonus ϵ1 = 0.1. In all experiments, the agent samples a goal uniformly at random, interacts with the resulting MDP for 100 episodes with 100 of steps per episode.