Reward Machines for Deep RL in Noisy and Uncertain Environments

Authors: Andrew Li, Zizhao Chen, Toryn Klassen, Pashootan Vaezipoor, Rodrigo Toro Icarte, Sheila McIlraith

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we explore the use of Reward Machines for Deep RL in noisy and uncertain environments. We characterize this problem as a POMDP and propose a suite of RL algorithms that exploit task structure under uncertain interpretation of the domain-specific vocabulary. Through theory and experiments, we expose pitfalls in naive approaches to this problem while simultaneously demonstrating how task structure can be successfully leveraged under noisy interpretations of the vocabulary.
Researcher Affiliation Collaboration Andrew C. Li University of Toronto Vector Institute Zizhao Chen Cornell University Toryn Q. Klassen University of Toronto Vector Institute Pashootan Vaezipoor Georgian.io Vector Institute Rodrigo Toro Icarte Pontificia Universidad Católica de Chile Centro Nacional de Inteligencia Artificial Sheila A. Mc Ilraith University of Toronto Vector Institute
Pseudocode Yes Algorithm 1: On-policy RL that decouples RM state inference using an abstraction model M and decision making.
Open Source Code Yes Code and videos are available at https://github.com/andrewli77/ reward-machines-noisy-environments.
Open Datasets No The paper states, 'We collect training datasets comprising 2K episodes in each domain (equalling 103K interactions in Traffic Light, 397K interactions in Kitchen, and 3.7M interactions in Colour Matching), along with validation and test datasets of 100 episodes each.' While these datasets were created for the experiments, no concrete public access information (link, DOI, formal citation) is provided for them.
Dataset Splits Yes We collect training datasets comprising 2K episodes in each domain (equalling 103K interactions in Traffic Light, 397K interactions in Kitchen, and 3.7M interactions in Colour Matching), along with validation and test datasets of 100 episodes each.
Hardware Specification Yes Each run occupied 1 GPU, 16 CPU workers, and 12G of RAM.
Software Dependencies No The paper mentions using 'the implementation of PPO found here under an MIT license: https://github.com/lcswillems/torch-ac' but does not specify exact version numbers for PPO, PyTorch, or other key software components.
Experiment Setup Yes We report all hyperparameter settings for the Deep RL experiments in Table 2.