Regret Bounds for Markov Decision Processes with Recursive Optimized Certainty Equivalents
Authors: Wenhao Xu, Xuefeng Gao, Xuedong He
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we conduct numerical experiments to illustrate the performance of the OCE-VI algorithm on randomly generated MDPs. ... Figures 2 and 3 illustrate the performance comparisons of the OCE-VI algorithm with other algorithms, where we plot the average regret of each algorithm as a function of the number of episodes K. We compute the expected regret of each algorithm by averaging over 30 independent runs |
| Researcher Affiliation | Academia | 1Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Hong Kong, China. Correspondence to: Xuefeng Gao <xfgao@se.cuhk.edu.hk>. |
| Pseudocode | Yes | Algorithm 1 The OCE-VI Algorithm |
| Open Source Code | No | The paper does not provide a direct link or explicit statement about the availability of its source code. |
| Open Datasets | No | The paper states: 'We adopt the methods in Dann (2019, Section 4.7) to randomly generate MDPs with state space S = {1, , S}, action space A = {1, , A} and episode length H.' However, it does not provide a direct link, DOI, or formal citation for a publicly available or open dataset. Dann (2019) is a PhD thesis, which describes methods for generating data, not a direct public dataset. |
| Dataset Splits | No | The paper does not explicitly provide training/validation/test splits, nor does it mention cross-validation. It describes how the MDPs are generated and used for evaluation. |
| Hardware Specification | No | The paper does not specify any hardware used for running the experiments. It only mentions 'randomly generated MDPs' and 'numerical experiments'. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software dependencies. It compares its algorithm with 'RSVI2 and RSQ2 algorithms in Fei et al. (2021a)' and 'UCBVI (with Chernoff-Hoeffding bonus) and UCBVI-BF (with Bernstein bonus) algorithms in Azar et al. (2017)' but does not list versions of the software used for implementation. |
| Experiment Setup | Yes | The paper states: 'The first one is (H, S, A) = (3, 6, 3), and we use the risk-aversion parameter β = 0.6 for the entropic risk and c = 1/6 for the mean-variance models. We set K = 10^6 and δ = 1/(2KH) for all algorithms. The second one is (H, S, A) = (6, 20, 3), and we use β = 0.6 for the entropic risk and c = 1/12 for the mean-variance models. Because the size of the MDP becomes larger and learning can be more difficult in the second setting, we consider K = 10^7'. |