Beyond Average Return in Markov Decision Processes
Authors: Alexandre Marthe, Aurélien Garivier, Claire Vernade
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiment: empirical validation of the bounds on a simple MDP We consider a simple Chain MDP environment of length H = 70 equal to the horizon (see Figure 1 (right)) [Rowland et al., 2019], with a single action leading to the same discrete reward distribution for every step. |
| Researcher Affiliation | Academia | Alexandre Marthe UMPA ENS de Lyon Lyon, France alexandre.marthe@ens-lyon.fr Aurelien Garivier UMPA UMR 5669 and LIP UMR 5668 Univ. Lyon, ENS de Lyon 46 allée d Italie F-69364 Lyon cedex 07, France aurelien.garivier@ens-lyon.fr Claire Vernade University of Tuebingen Tuebingen, Germany claire.vernade@uni-tuebingen.de |
| Pseudocode | Yes | Algorithm 1 Policy Evaluation (Dynamic Programming) for Distributional RL; Algorithm 2 Pseudo-Algorithm: Exact Planning with Distributional RL; Algorithm 3 Q-Learning for Linear and Exponential Utilities |
| Open Source Code | No | The paper does not provide any explicit statement about releasing source code for the described methodology, nor does it include a link to a code repository. |
| Open Datasets | No | We consider a simple Chain MDP environment of length H = 70 equal to the horizon (see Figure 1 (right)) [Rowland et al., 2019], with a single action leading to the same discrete reward distribution for every step. We consider a Bernouilli reward distribution B(0.5) for each state so that the number of atoms for the return only grows linearly2 with the number of steps, which allows to compute the exact distribution easily. |
| Dataset Splits | No | The paper describes a simple synthetic MDP environment but does not specify any train/validation/test dataset splits or cross-validation setup for reproducibility. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper does not provide specific software dependency details with version numbers (e.g., library names, framework versions, or solver versions) needed to replicate the experiment. |
| Experiment Setup | Yes | We consider a simple Chain MDP environment of length H = 70 equal to the horizon (see Figure 1 (right)) [Rowland et al., 2019], with a single action leading to the same discrete reward distribution for every step. We consider a Bernouilli reward distribution B(0.5) for each state so that the number of atoms for the return only grows linearly2 with the number of steps, which allows to compute the exact distribution easily... with a quantile projection with resolution N = 1000... We also empirically validate Theorem 1 by computing the CVa R(α) for α {0.1, 0.25}, corresponding respectively to distorted means with Lipschitz constants L = {10, 4}. |