Reward Machines for Deep RL in Noisy and Uncertain Environments
Authors: Andrew Li, Zizhao Chen, Toryn Klassen, Pashootan Vaezipoor, Rodrigo Toro Icarte, Sheila McIlraith
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we explore the use of Reward Machines for Deep RL in noisy and uncertain environments. We characterize this problem as a POMDP and propose a suite of RL algorithms that exploit task structure under uncertain interpretation of the domain-specific vocabulary. Through theory and experiments, we expose pitfalls in naive approaches to this problem while simultaneously demonstrating how task structure can be successfully leveraged under noisy interpretations of the vocabulary. |
| Researcher Affiliation | Collaboration | Andrew C. Li University of Toronto Vector Institute Zizhao Chen Cornell University Toryn Q. Klassen University of Toronto Vector Institute Pashootan Vaezipoor Georgian.io Vector Institute Rodrigo Toro Icarte Pontificia Universidad Católica de Chile Centro Nacional de Inteligencia Artificial Sheila A. Mc Ilraith University of Toronto Vector Institute |
| Pseudocode | Yes | Algorithm 1: On-policy RL that decouples RM state inference using an abstraction model M and decision making. |
| Open Source Code | Yes | Code and videos are available at https://github.com/andrewli77/ reward-machines-noisy-environments. |
| Open Datasets | No | The paper states, 'We collect training datasets comprising 2K episodes in each domain (equalling 103K interactions in Traffic Light, 397K interactions in Kitchen, and 3.7M interactions in Colour Matching), along with validation and test datasets of 100 episodes each.' While these datasets were created for the experiments, no concrete public access information (link, DOI, formal citation) is provided for them. |
| Dataset Splits | Yes | We collect training datasets comprising 2K episodes in each domain (equalling 103K interactions in Traffic Light, 397K interactions in Kitchen, and 3.7M interactions in Colour Matching), along with validation and test datasets of 100 episodes each. |
| Hardware Specification | Yes | Each run occupied 1 GPU, 16 CPU workers, and 12G of RAM. |
| Software Dependencies | No | The paper mentions using 'the implementation of PPO found here under an MIT license: https://github.com/lcswillems/torch-ac' but does not specify exact version numbers for PPO, PyTorch, or other key software components. |
| Experiment Setup | Yes | We report all hyperparameter settings for the Deep RL experiments in Table 2. |