Learning Reward Machines for Partially Observable Reinforcement Learning
Authors: Rodrigo Toro Icarte, Ethan Waldie, Toryn Klassen, Rick Valenzano, Margarita Castro, Sheila McIlraith
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show the effectiveness of this approach on three partially observable domains, where it significantly outperforms A3C, PPO, and ACER, and discuss its advantages, limitations, and broader potential. |
| Researcher Affiliation | Collaboration | Rodrigo Toro Icarte University of Toronto Vector Institute Ethan Waldie University of Toronto Toryn Q. Klassen University of Toronto Vector Institute Richard Valenzano Element AI Margarita P. Castro University of Toronto Sheila A. Mc Ilraith University of Toronto Vector Institute |
| Pseudocode | Yes | The complete pseudo-code can be found in the supplementary material (Algorithm 1). |
| Open Source Code | Yes | Our code is available at https://bitbucket.org/RToroIcarte/lrm. |
| Open Datasets | No | We tested our approach on three partially observable grid domains (Figure 1). |
| Dataset Splits | No | Figure 3 shows the total cumulative rewards that each approach gets every 10, 000 training steps. |
| Hardware Specification | Yes | Given a training set composed of one million transitions, a simple Python implementation of Tabu search takes less than 2.5 minutes to learn an RM across all our environments, when using 62 workers on a Threadripper 2990WX processor. |
| Software Dependencies | No | We compared against 4 baselines: DDQN [35], A3C [24], ACER [37], and PPO [29] using the Open AI baseline implementations [12]. |
| Experiment Setup | Yes | In all domains, we used umax = 10, tw = 200, 000, an epsilon greedy policy with ϵ = 0.1, and a discount factor γ = 0.9. The size of the Tabu list and the number of steps that the Tabu search performs before returning the best RM found is 100. |