Reinforcement Learning with Non-Markovian Rewards
Authors: Maor Gaon, Ronen Brafman3980-3987
AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We describe and evaluate empirically four combinations of the classical RL algorithm Q-learning and R-max with automata learning algorithms to obtain new RL algorithms for domains with NMR. We also prove that some of these variants converge to an optimal policy in the limit. We evaluated the algorithms on two environments: non Markovian multi-armed bandit (MAB) and robot world. |
| Researcher Affiliation | Academia | Maor Gaon, Ronen I. Brafman Ben-Gurion University of the Negev, Beer Sheva, Israel maorga@post.bgu.ac.il brafman@cs.bgu.ac.il |
| Pseudocode | Yes | Algorithm 1: RL with NMR and EDSM; Algorithm 2: RL with NMR and L |
| Open Source Code | No | The paper mentions using specific libraries for automata learning, such as "GILearning library (github.com/gabrer/gi-learning)" and "Flex Fringe library (Verwer and Hammerschmidt 2017)", but does not provide a link or statement about the open-source release of the authors' own implementation of the proposed RL algorithms. |
| Open Datasets | No | The paper describes two custom environments, "non Markovian multi-armed bandit (MAB) and robot world," which were created for the experiments. It does not provide access information (link, DOI, specific citation with authors/year) for a publicly available or open dataset. |
| Dataset Splits | No | The paper does not specify traditional training/validation/test dataset splits. For the MAB experiments, it states "Tests were run for 4 million steps, with the results evaluated every 100,000 steps (= 5,000 traces)", and for Robot World, "Each experiment consists of 25,000,000 steps. The policy was evaluated every 1,000,000 steps." This refers to evaluation points during the learning process, not a distinct validation set for hyperparameter tuning or early stopping. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used to run the experiments (e.g., GPU/CPU models, memory, or cloud instance types). |
| Software Dependencies | No | The paper mentions using Python and specific libraries like "GILearning library (github.com/gabrer/gi-learning)" and "Flex Fringe library (Verwer and Hammerschmidt 2017)". However, it does not provide specific version numbers for Python or these libraries, which are necessary for reproducible software dependencies. |
| Experiment Setup | Yes | All solvers uses ϵ-greedy exploration with simulated annealing from 0.9 to 0.1 using a rate of 1e-6 updates each step. The learning rate α was set to 0.1. The discount factor was 0.99. The discount factor was 0.999999. |