reproducibilityindex.ai

Reward Machines for Deep RL in Noisy and Uncertain Environments

Authors: Andrew Li, Zizhao Chen, Toryn Klassen, Pashootan Vaezipoor, Rodrigo Toro Icarte, Sheila McIlraith

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we explore the use of Reward Machines for Deep RL in noisy and uncertain environments. We characterize this problem as a POMDP and propose a suite of RL algorithms that exploit task structure under uncertain interpretation of the domain-specific vocabulary. Through theory and experiments, we expose pitfalls in naive approaches to this problem while simultaneously demonstrating how task structure can be successfully leveraged under noisy interpretations of the vocabulary.
Researcher Affiliation	Collaboration	Andrew C. Li University of Toronto Vector Institute Zizhao Chen Cornell University Toryn Q. Klassen University of Toronto Vector Institute Pashootan Vaezipoor Georgian.io Vector Institute Rodrigo Toro Icarte Pontificia Universidad Católica de Chile Centro Nacional de Inteligencia Artificial Sheila A. Mc Ilraith University of Toronto Vector Institute
Pseudocode	Yes	Algorithm 1: On-policy RL that decouples RM state inference using an abstraction model M and decision making.
Open Source Code	Yes	Code and videos are available at https://github.com/andrewli77/ reward-machines-noisy-environments.
Open Datasets	No	The paper states, 'We collect training datasets comprising 2K episodes in each domain (equalling 103K interactions in Traffic Light, 397K interactions in Kitchen, and 3.7M interactions in Colour Matching), along with validation and test datasets of 100 episodes each.' While these datasets were created for the experiments, no concrete public access information (link, DOI, formal citation) is provided for them.
Dataset Splits	Yes	We collect training datasets comprising 2K episodes in each domain (equalling 103K interactions in Traffic Light, 397K interactions in Kitchen, and 3.7M interactions in Colour Matching), along with validation and test datasets of 100 episodes each.
Hardware Specification	Yes	Each run occupied 1 GPU, 16 CPU workers, and 12G of RAM.
Software Dependencies	No	The paper mentions using 'the implementation of PPO found here under an MIT license: https://github.com/lcswillems/torch-ac' but does not specify exact version numbers for PPO, PyTorch, or other key software components.
Experiment Setup	Yes	We report all hyperparameter settings for the Deep RL experiments in Table 2.