Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Learning Reward Machines for Partially Observable Reinforcement Learning
Authors: Rodrigo Toro Icarte, Ethan Waldie, Toryn Klassen, Rick Valenzano, Margarita Castro, Sheila McIlraith
NeurIPS 2019 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show the effectiveness of this approach on three partially observable domains, where it significantly outperforms A3C, PPO, and ACER, and discuss its advantages, limitations, and broader potential. |
| Researcher Affiliation | Collaboration | Rodrigo Toro Icarte University of Toronto Vector Institute Ethan Waldie University of Toronto Toryn Q. Klassen University of Toronto Vector Institute Richard Valenzano Element AI Margarita P. Castro University of Toronto Sheila A. Mc Ilraith University of Toronto Vector Institute |
| Pseudocode | Yes | The complete pseudo-code can be found in the supplementary material (Algorithm 1). |
| Open Source Code | Yes | Our code is available at https://bitbucket.org/RToroIcarte/lrm. |
| Open Datasets | No | We tested our approach on three partially observable grid domains (Figure 1). |
| Dataset Splits | No | Figure 3 shows the total cumulative rewards that each approach gets every 10, 000 training steps. |
| Hardware Specification | Yes | Given a training set composed of one million transitions, a simple Python implementation of Tabu search takes less than 2.5 minutes to learn an RM across all our environments, when using 62 workers on a Threadripper 2990WX processor. |
| Software Dependencies | No | We compared against 4 baselines: DDQN [35], A3C [24], ACER [37], and PPO [29] using the Open AI baseline implementations [12]. |
| Experiment Setup | Yes | In all domains, we used umax = 10, tw = 200, 000, an epsilon greedy policy with ϵ = 0.1, and a discount factor γ = 0.9. The size of the Tabu list and the number of steps that the Tabu search performs before returning the best RM found is 100. |