reproducibilityindex.ai

Reinforcement Learning with Stochastic Reward Machines

Authors: Jan Corazza, Ivan Gavran, Daniel Neider6429-6436

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To assess the performance of SRMI, we have implemented a Python 3 prototype... To assess its performance, we compare SRMI to the baseline algorithm and the JIRP algorithm for classical reward machines on two case studies: the mining example from Section 2 and an example inspired by harvesting... Our primary metric is the cumulative reward averaged over the last 100 episodes. We conducted 10 independent runs for each algorithm... All experiments were conducted on a 3 GHz machine with 1.5 TB RAM. Mining Fig. 3a shows the comparison on the Mining environment. ... Harvest Fig. 5a shows the comparison on the Harvest environment.
Researcher Affiliation	Academia	Jan Corazza1,2, Ivan Gavran2, Daniel Neider2 1 University of Zagreb 2 Max Planck Institute for Software System
Pseudocode	Yes	Algorithm 1: SRMI Algorithm 2: Estimates
Open Source Code	No	To assess the performance of SRMI, we have implemented a Python 3 prototype based on code by Toro Icarte et al. (2018), which we will make publicly available (also see the supplementary material).
Open Datasets	No	We illustrate all notions on a running example called Mining. Mining, inspired by variations of Minecraft (e.g., (Andreas, Klein, and Levine 2017)), models the problem of finding and exploiting ore in an unknown environment. The Harvest environment represents a cropfarming cycle. These are described as internal examples or environments, not publicly available datasets with access details.
Dataset Splits	No	The paper describes interactions with an environment through “episodes” and “traces” but does not specify exact training, validation, or test dataset splits.
Hardware Specification	Yes	All experiments were conducted on a 3 GHz machine with 1.5 TB RAM.
Software Dependencies	Yes	To assess the performance of SRMI, we have implemented a Python 3 prototype... using Z3 (de Moura and Bjørner 2008) as the constraint solver.
Experiment Setup	Yes	Our primary metric is the cumulative reward averaged over the last 100 episodes. We conducted 10 independent runs for each algorithm, For this case study, we have set the baseline algorithm to replay 20 traces per counterexample. SRMI was successful in learning the optimal policy, while the baseline algorithm got stuck in collecting the required number of samples (5).