Reinforcement Learning of Risk-Constrained Policies in Markov Decision Processes
Authors: Tomáš Brázdil, Krishnendu Chatterjee, Petr Novotný, Jiří Vahala9794-9801
AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We implemented RAlph and evaluated it on two sets of benchmarks. The first one is a modified, perfectly observable version of Hallway (Pineau et al. 2003; Smith and Simmons 2004)... As a second benchmark, we consider a controllable random walk (RW). The results are summarized in Table 1. |
| Researcher Affiliation | Academia | 1Faculty of Informatics, Masaryk University, Brno, Czech Republic {xbrazdil, petr.novotny, xvahala1}@fi.muni.cz 2Institute of Science and Technology Austria, Klosterneuburg, Austria Krishnendu.Chatterjee@ist.ac.at |
| Pseudocode | Yes | Algorithm 1: Training and evaluation of RAlph. and Algorithm 2: The episode sampling of RAlph. |
| Open Source Code | Yes | Implementation can be found at https://github.com/snurkabill/ Master Thesis/releases/tag/AAAI_release |
| Open Datasets | Yes | We implemented RAlph and evaluated it on two sets of benchmarks. The first one is a modified, perfectly observable version of Hallway (Pineau et al. 2003; Smith and Simmons 2004) |
| Dataset Splits | No | The paper describes training and evaluation phases using episodes, but it does not specify explicit train/validation/test dataset splits with percentages or counts for the datasets used. |
| Hardware Specification | Yes | The test configuration was: CPU: Intel Xeon E5-2620 v2@2.1GHz (24 cores); 8GB heap size; Debian 8. |
| Software Dependencies | No | The test configuration was: CPU: Intel Xeon E5-2620 v2@2.1GHz (24 cores); 8GB heap size; Debian 8. |
| Experiment Setup | Yes | Input: MDP M (with a horizon H), risk bound Δ, no. of training episodes m, batch size n (from Algorithm 1) and C is a suitable exploration constant, a parameter fixed in advance of the computation. and Both algorithms were evaluated over 1000 episodes, with a timeout of 1 hour per evaluation. |