Provably Efficient Risk-Sensitive Reinforcement Learning: Iterated CVaR and Worst Path
Authors: Yihan Du, Siwei Wang, Longbo Huang
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Moreover, we present experiments to validate our theoretical results and show the performance superiority of our algorithm (see Appendix A). ... As shown in Figure 2, our algorithm ICVa R-RM achieves a significantly lower regret than the other algorithms EULER (Zanette & Brunskill, 2019) and RSVI2 (Fei et al., 2021a), which demonstrates that ICVa R-RM can effectively control the risk under the Iterated CVa R criterion and shows performance superiority over the baselines. Moreover, the influences of parameters α, δ, H, S, A and K on the regret of algorithm ICVa R-RM match our theoretical bounds. |
| Researcher Affiliation | Collaboration | Yihan Du Institute for Interdisciplinary Information Sciences Tsinghua University Beijing, China duyh18@mails.tsinghua.edu.cn Siwei Wang Microsoft Research Beijing, China siweiwang@microsoft.com Longbo Huang Institute for Interdisciplinary Information Sciences Tsinghua University Beijing, China longbohuang@tsinghua.edu.cn |
| Pseudocode | Yes | Algorithm 1: ICVa R-RM (Page 4), Algorithm 2: Max WP (Page 5), Algorithm 3: ICVa R-BPI (Appendix E.1, Page 19). |
| Open Source Code | No | The paper does not contain any statement about releasing source code or providing a link to a code repository for the methodology described. |
| Open Datasets | No | In our experiments, we consider an H-layered MDP with S = 3(H 1) + 1 states and A actions. ... The agent starts from s0 in layer 1, and for each step h [H], she takes an action from {a1, . . . , a A}, and then transitions to one of three states in the next layer. The paper describes a synthetic MDP environment for its experiments and does not mention using or providing access to any publicly available dataset. |
| Dataset Splits | No | The paper describes its experimental setup within a simulated MDP environment but does not mention providing specific training/validation/test dataset splits. Performance is evaluated through cumulative regret over episodes, which is a common metric in episodic reinforcement learning, but not a dataset split. |
| Hardware Specification | No | The paper does not provide any specific hardware details such as GPU/CPU models, memory, or cloud computing resources used for running the experiments. |
| Software Dependencies | No | The paper does not specify any software dependencies with version numbers (e.g., Python version, specific libraries, or frameworks). |
| Experiment Setup | Yes | We set α {0.05, 0.1, 0.15}, δ {0.5, 0.005, 0.00005}, H {2, 5, 10}, S {7, 13, 25}, A {3, 5, 12} and K [0, 10000] (the change of K can be seen from the X-axis in Figure 2). We take α = 0.05, δ = 0.005, H = 5, S = 13, A = 5 and K = 10000 as the basic setting, and change parameters α, δ, H, S, A and K to see how they affect the empirical performance of algorithm ICVa R-RM. |