reproducibilityindex.ai

Learning Reward Machines for Partially Observable Reinforcement Learning

Authors: Rodrigo Toro Icarte, Ethan Waldie, Toryn Klassen, Rick Valenzano, Margarita Castro, Sheila McIlraith

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show the effectiveness of this approach on three partially observable domains, where it signiﬁcantly outperforms A3C, PPO, and ACER, and discuss its advantages, limitations, and broader potential.
Researcher Affiliation	Collaboration	Rodrigo Toro Icarte University of Toronto Vector Institute Ethan Waldie University of Toronto Toryn Q. Klassen University of Toronto Vector Institute Richard Valenzano Element AI Margarita P. Castro University of Toronto Sheila A. Mc Ilraith University of Toronto Vector Institute
Pseudocode	Yes	The complete pseudo-code can be found in the supplementary material (Algorithm 1).
Open Source Code	Yes	Our code is available at https://bitbucket.org/RToroIcarte/lrm.
Open Datasets	No	We tested our approach on three partially observable grid domains (Figure 1).
Dataset Splits	No	Figure 3 shows the total cumulative rewards that each approach gets every 10, 000 training steps.
Hardware Specification	Yes	Given a training set composed of one million transitions, a simple Python implementation of Tabu search takes less than 2.5 minutes to learn an RM across all our environments, when using 62 workers on a Threadripper 2990WX processor.
Software Dependencies	No	We compared against 4 baselines: DDQN [35], A3C [24], ACER [37], and PPO [29] using the Open AI baseline implementations [12].
Experiment Setup	Yes	In all domains, we used umax = 10, tw = 200, 000, an epsilon greedy policy with ϵ = 0.1, and a discount factor γ = 0.9. The size of the Tabu list and the number of steps that the Tabu search performs before returning the best RM found is 100.