Reinforcement Learning of Risk-Constrained Policies in Markov Decision Processes

Authors: Tomáš Brázdil, Krishnendu Chatterjee, Petr Novotný, Jiří Vahala9794-9801

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We implemented RAlph and evaluated it on two sets of benchmarks. The first one is a modified, perfectly observable version of Hallway (Pineau et al. 2003; Smith and Simmons 2004)... As a second benchmark, we consider a controllable random walk (RW). The results are summarized in Table 1.
Researcher Affiliation Academia 1Faculty of Informatics, Masaryk University, Brno, Czech Republic {xbrazdil, petr.novotny, xvahala1}@fi.muni.cz 2Institute of Science and Technology Austria, Klosterneuburg, Austria Krishnendu.Chatterjee@ist.ac.at
Pseudocode Yes Algorithm 1: Training and evaluation of RAlph. and Algorithm 2: The episode sampling of RAlph.
Open Source Code Yes Implementation can be found at https://github.com/snurkabill/ Master Thesis/releases/tag/AAAI_release
Open Datasets Yes We implemented RAlph and evaluated it on two sets of benchmarks. The first one is a modified, perfectly observable version of Hallway (Pineau et al. 2003; Smith and Simmons 2004)
Dataset Splits No The paper describes training and evaluation phases using episodes, but it does not specify explicit train/validation/test dataset splits with percentages or counts for the datasets used.
Hardware Specification Yes The test configuration was: CPU: Intel Xeon E5-2620 v2@2.1GHz (24 cores); 8GB heap size; Debian 8.
Software Dependencies No The test configuration was: CPU: Intel Xeon E5-2620 v2@2.1GHz (24 cores); 8GB heap size; Debian 8.
Experiment Setup Yes Input: MDP M (with a horizon H), risk bound Δ, no. of training episodes m, batch size n (from Algorithm 1) and C is a suitable exploration constant, a parameter fixed in advance of the computation. and Both algorithms were evaluated over 1000 episodes, with a timeout of 1 hour per evaluation.