reproducibilityindex.ai

Beyond Average Return in Markov Decision Processes

Authors: Alexandre Marthe, Aurélien Garivier, Claire Vernade

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiment: empirical validation of the bounds on a simple MDP We consider a simple Chain MDP environment of length H = 70 equal to the horizon (see Figure 1 (right)) [Rowland et al., 2019], with a single action leading to the same discrete reward distribution for every step.
Researcher Affiliation	Academia	Alexandre Marthe UMPA ENS de Lyon Lyon, France alexandre.marthe@ens-lyon.fr Aurelien Garivier UMPA UMR 5669 and LIP UMR 5668 Univ. Lyon, ENS de Lyon 46 allée d Italie F-69364 Lyon cedex 07, France aurelien.garivier@ens-lyon.fr Claire Vernade University of Tuebingen Tuebingen, Germany claire.vernade@uni-tuebingen.de
Pseudocode	Yes	Algorithm 1 Policy Evaluation (Dynamic Programming) for Distributional RL; Algorithm 2 Pseudo-Algorithm: Exact Planning with Distributional RL; Algorithm 3 Q-Learning for Linear and Exponential Utilities
Open Source Code	No	The paper does not provide any explicit statement about releasing source code for the described methodology, nor does it include a link to a code repository.
Open Datasets	No	We consider a simple Chain MDP environment of length H = 70 equal to the horizon (see Figure 1 (right)) [Rowland et al., 2019], with a single action leading to the same discrete reward distribution for every step. We consider a Bernouilli reward distribution B(0.5) for each state so that the number of atoms for the return only grows linearly2 with the number of steps, which allows to compute the exact distribution easily.
Dataset Splits	No	The paper describes a simple synthetic MDP environment but does not specify any train/validation/test dataset splits or cross-validation setup for reproducibility.
Hardware Specification	No	The paper does not provide any specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies	No	The paper does not provide specific software dependency details with version numbers (e.g., library names, framework versions, or solver versions) needed to replicate the experiment.
Experiment Setup	Yes	We consider a simple Chain MDP environment of length H = 70 equal to the horizon (see Figure 1 (right)) [Rowland et al., 2019], with a single action leading to the same discrete reward distribution for every step. We consider a Bernouilli reward distribution B(0.5) for each state so that the number of atoms for the return only grows linearly2 with the number of steps, which allows to compute the exact distribution easily... with a quantile projection with resolution N = 1000... We also empirically validate Theorem 1 by computing the CVa R(α) for α {0.1, 0.25}, corresponding respectively to distorted means with Lipschitz constants L = {10, 4}.