TempLe: Learning Template of Transitions for Sample Efficient Multi-task RL

Authors: Yanchao Sun, Xiangyu Yin, Furong Huang9765-9773

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we demonstrate empirical results to show O-Temp Le and FM-Temp Le outperform existing state-of-the-art algorithms both in the finite-model setting and in the more realistic online setting.
Researcher Affiliation Academia 1 University of Maryland, College Park, MD, 20742 2 Beijing University of Posts and Telecommunications, China ycs@umd.edu, yinxiangyu@bupt.edu.cn, furongh@umd.edu
Pseudocode Yes Algorithm 1 Online Template Learning (O-Temp Le) and Algorithm 2 TT Functions
Open Source Code Yes Our code is available at https://github.com/umd-huang-lab/templatereinforcement-learning.
Open Datasets No The paper describes generating environments (e.g., maze tasks) and sampling tasks from them, but does not provide concrete access information or citations for a publicly available or open dataset.
Dataset Splits No The paper discusses reinforcement learning tasks and environments, and does not specify train/validation/test dataset splits needed for reproducibility. It describes generating tasks or drawing them from a set of MDPs.
Hardware Specification No The paper does not provide specific hardware details (like exact GPU/CPU models or cloud instance types) used for running its experiments.
Software Dependencies No The paper mentions using RMax and Q-learning as base learners or baselines, but does not provide specific version numbers for any ancillary software dependencies or libraries (e.g., Python version, deep learning framework versions).
Experiment Setup Yes Input: user-specified TT gap ˆτ; error tolerance ϵ; discount factor γ; regular known threshold m; small known threshold ms. Environment. For the more realistic Online MTRL which allows the number of MDP models to be extremely large, we generalize the traditional maze environment to have arbitrary combinations of landforms, as shown in Figure 1. We use 3 types of landforms, sand, marble and ice, respectively with slipping probabilities 0, 0.2, and 0.4. We test various hyper-parameters to understand how significantly the performance of the algorithms could be affected by inaccurate guesses of ˆτ and Γ, shown in Figure 3.