Reinforcement Learning with Stochastic Reward Machines

Authors: Jan Corazza, Ivan Gavran, Daniel Neider6429-6436

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To assess the performance of SRMI, we have implemented a Python 3 prototype... To assess its performance, we compare SRMI to the baseline algorithm and the JIRP algorithm for classical reward machines on two case studies: the mining example from Section 2 and an example inspired by harvesting... Our primary metric is the cumulative reward averaged over the last 100 episodes. We conducted 10 independent runs for each algorithm... All experiments were conducted on a 3 GHz machine with 1.5 TB RAM. Mining Fig. 3a shows the comparison on the Mining environment. ... Harvest Fig. 5a shows the comparison on the Harvest environment.
Researcher Affiliation Academia Jan Corazza1,2, Ivan Gavran2, Daniel Neider2 1 University of Zagreb 2 Max Planck Institute for Software System
Pseudocode Yes Algorithm 1: SRMI Algorithm 2: Estimates
Open Source Code No To assess the performance of SRMI, we have implemented a Python 3 prototype based on code by Toro Icarte et al. (2018), which we will make publicly available (also see the supplementary material).
Open Datasets No We illustrate all notions on a running example called Mining. Mining, inspired by variations of Minecraft (e.g., (Andreas, Klein, and Levine 2017)), models the problem of finding and exploiting ore in an unknown environment. The Harvest environment represents a cropfarming cycle. These are described as internal examples or environments, not publicly available datasets with access details.
Dataset Splits No The paper describes interactions with an environment through “episodes” and “traces” but does not specify exact training, validation, or test dataset splits.
Hardware Specification Yes All experiments were conducted on a 3 GHz machine with 1.5 TB RAM.
Software Dependencies Yes To assess the performance of SRMI, we have implemented a Python 3 prototype... using Z3 (de Moura and Bjørner 2008) as the constraint solver.
Experiment Setup Yes Our primary metric is the cumulative reward averaged over the last 100 episodes. We conducted 10 independent runs for each algorithm, For this case study, we have set the baseline algorithm to replay 20 traces per counterexample. SRMI was successful in learning the optimal policy, while the baseline algorithm got stuck in collecting the required number of samples (5).