Provably Efficient Risk-Sensitive Reinforcement Learning: Iterated CVaR and Worst Path

Authors: Yihan Du, Siwei Wang, Longbo Huang

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Moreover, we present experiments to validate our theoretical results and show the performance superiority of our algorithm (see Appendix A). ... As shown in Figure 2, our algorithm ICVa R-RM achieves a significantly lower regret than the other algorithms EULER (Zanette & Brunskill, 2019) and RSVI2 (Fei et al., 2021a), which demonstrates that ICVa R-RM can effectively control the risk under the Iterated CVa R criterion and shows performance superiority over the baselines. Moreover, the influences of parameters α, δ, H, S, A and K on the regret of algorithm ICVa R-RM match our theoretical bounds.
Researcher Affiliation Collaboration Yihan Du Institute for Interdisciplinary Information Sciences Tsinghua University Beijing, China duyh18@mails.tsinghua.edu.cn Siwei Wang Microsoft Research Beijing, China siweiwang@microsoft.com Longbo Huang Institute for Interdisciplinary Information Sciences Tsinghua University Beijing, China longbohuang@tsinghua.edu.cn
Pseudocode Yes Algorithm 1: ICVa R-RM (Page 4), Algorithm 2: Max WP (Page 5), Algorithm 3: ICVa R-BPI (Appendix E.1, Page 19).
Open Source Code No The paper does not contain any statement about releasing source code or providing a link to a code repository for the methodology described.
Open Datasets No In our experiments, we consider an H-layered MDP with S = 3(H 1) + 1 states and A actions. ... The agent starts from s0 in layer 1, and for each step h [H], she takes an action from {a1, . . . , a A}, and then transitions to one of three states in the next layer. The paper describes a synthetic MDP environment for its experiments and does not mention using or providing access to any publicly available dataset.
Dataset Splits No The paper describes its experimental setup within a simulated MDP environment but does not mention providing specific training/validation/test dataset splits. Performance is evaluated through cumulative regret over episodes, which is a common metric in episodic reinforcement learning, but not a dataset split.
Hardware Specification No The paper does not provide any specific hardware details such as GPU/CPU models, memory, or cloud computing resources used for running the experiments.
Software Dependencies No The paper does not specify any software dependencies with version numbers (e.g., Python version, specific libraries, or frameworks).
Experiment Setup Yes We set α {0.05, 0.1, 0.15}, δ {0.5, 0.005, 0.00005}, H {2, 5, 10}, S {7, 13, 25}, A {3, 5, 12} and K [0, 10000] (the change of K can be seen from the X-axis in Figure 2). We take α = 0.05, δ = 0.005, H = 5, S = 13, A = 5 and K = 10000 as the basic setting, and change parameters α, δ, H, S, A and K to see how they affect the empirical performance of algorithm ICVa R-RM.