Regret Bounds for Markov Decision Processes with Recursive Optimized Certainty Equivalents

Authors: Wenhao Xu, Xuefeng Gao, Xuedong He

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we conduct numerical experiments to illustrate the performance of the OCE-VI algorithm on randomly generated MDPs. ... Figures 2 and 3 illustrate the performance comparisons of the OCE-VI algorithm with other algorithms, where we plot the average regret of each algorithm as a function of the number of episodes K. We compute the expected regret of each algorithm by averaging over 30 independent runs
Researcher Affiliation Academia 1Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Hong Kong, China. Correspondence to: Xuefeng Gao <xfgao@se.cuhk.edu.hk>.
Pseudocode Yes Algorithm 1 The OCE-VI Algorithm
Open Source Code No The paper does not provide a direct link or explicit statement about the availability of its source code.
Open Datasets No The paper states: 'We adopt the methods in Dann (2019, Section 4.7) to randomly generate MDPs with state space S = {1, , S}, action space A = {1, , A} and episode length H.' However, it does not provide a direct link, DOI, or formal citation for a publicly available or open dataset. Dann (2019) is a PhD thesis, which describes methods for generating data, not a direct public dataset.
Dataset Splits No The paper does not explicitly provide training/validation/test splits, nor does it mention cross-validation. It describes how the MDPs are generated and used for evaluation.
Hardware Specification No The paper does not specify any hardware used for running the experiments. It only mentions 'randomly generated MDPs' and 'numerical experiments'.
Software Dependencies No The paper does not provide specific version numbers for any software dependencies. It compares its algorithm with 'RSVI2 and RSQ2 algorithms in Fei et al. (2021a)' and 'UCBVI (with Chernoff-Hoeffding bonus) and UCBVI-BF (with Bernstein bonus) algorithms in Azar et al. (2017)' but does not list versions of the software used for implementation.
Experiment Setup Yes The paper states: 'The first one is (H, S, A) = (3, 6, 3), and we use the risk-aversion parameter β = 0.6 for the entropic risk and c = 1/6 for the mean-variance models. We set K = 10^6 and δ = 1/(2KH) for all algorithms. The second one is (H, S, A) = (6, 20, 3), and we use β = 0.6 for the entropic risk and c = 1/12 for the mean-variance models. Because the size of the MDP becomes larger and learning can be more difficult in the second setting, we consider K = 10^7'.