reproducibilityindex.ai

Reinforcement Learning with Non-Markovian Rewards

Authors: Maor Gaon, Ronen Brafman3980-3987

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We describe and evaluate empirically four combinations of the classical RL algorithm Q-learning and R-max with automata learning algorithms to obtain new RL algorithms for domains with NMR. We also prove that some of these variants converge to an optimal policy in the limit. We evaluated the algorithms on two environments: non Markovian multi-armed bandit (MAB) and robot world.
Researcher Affiliation	Academia	Maor Gaon, Ronen I. Brafman Ben-Gurion University of the Negev, Beer Sheva, Israel maorga@post.bgu.ac.il brafman@cs.bgu.ac.il
Pseudocode	Yes	Algorithm 1: RL with NMR and EDSM; Algorithm 2: RL with NMR and L
Open Source Code	No	The paper mentions using specific libraries for automata learning, such as "GILearning library (github.com/gabrer/gi-learning)" and "Flex Fringe library (Verwer and Hammerschmidt 2017)", but does not provide a link or statement about the open-source release of the authors' own implementation of the proposed RL algorithms.
Open Datasets	No	The paper describes two custom environments, "non Markovian multi-armed bandit (MAB) and robot world," which were created for the experiments. It does not provide access information (link, DOI, specific citation with authors/year) for a publicly available or open dataset.
Dataset Splits	No	The paper does not specify traditional training/validation/test dataset splits. For the MAB experiments, it states "Tests were run for 4 million steps, with the results evaluated every 100,000 steps (= 5,000 traces)", and for Robot World, "Each experiment consists of 25,000,000 steps. The policy was evaluated every 1,000,000 steps." This refers to evaluation points during the learning process, not a distinct validation set for hyperparameter tuning or early stopping.
Hardware Specification	No	The paper does not provide specific details about the hardware used to run the experiments (e.g., GPU/CPU models, memory, or cloud instance types).
Software Dependencies	No	The paper mentions using Python and specific libraries like "GILearning library (github.com/gabrer/gi-learning)" and "Flex Fringe library (Verwer and Hammerschmidt 2017)". However, it does not provide specific version numbers for Python or these libraries, which are necessary for reproducible software dependencies.
Experiment Setup	Yes	All solvers uses ϵ-greedy exploration with simulated annealing from 0.9 to 0.1 using a rate of 1e-6 updates each step. The learning rate α was set to 0.1. The discount factor was 0.99. The discount factor was 0.999999.