Learning Reward Machines for Partially Observable Reinforcement Learning

Authors: Rodrigo Toro Icarte, Ethan Waldie, Toryn Klassen, Rick Valenzano, Margarita Castro, Sheila McIlraith

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show the effectiveness of this approach on three partially observable domains, where it significantly outperforms A3C, PPO, and ACER, and discuss its advantages, limitations, and broader potential.
Researcher Affiliation Collaboration Rodrigo Toro Icarte University of Toronto Vector Institute Ethan Waldie University of Toronto Toryn Q. Klassen University of Toronto Vector Institute Richard Valenzano Element AI Margarita P. Castro University of Toronto Sheila A. Mc Ilraith University of Toronto Vector Institute
Pseudocode Yes The complete pseudo-code can be found in the supplementary material (Algorithm 1).
Open Source Code Yes Our code is available at https://bitbucket.org/RToroIcarte/lrm.
Open Datasets No We tested our approach on three partially observable grid domains (Figure 1).
Dataset Splits No Figure 3 shows the total cumulative rewards that each approach gets every 10, 000 training steps.
Hardware Specification Yes Given a training set composed of one million transitions, a simple Python implementation of Tabu search takes less than 2.5 minutes to learn an RM across all our environments, when using 62 workers on a Threadripper 2990WX processor.
Software Dependencies No We compared against 4 baselines: DDQN [35], A3C [24], ACER [37], and PPO [29] using the Open AI baseline implementations [12].
Experiment Setup Yes In all domains, we used umax = 10, tw = 200, 000, an epsilon greedy policy with ϵ = 0.1, and a discount factor γ = 0.9. The size of the Tabu list and the number of steps that the Tabu search performs before returning the best RM found is 100.