Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR
Authors: Kaiwen Wang, Nathan Kallus, Wen Sun
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | In this paper, we study risk-sensitive Reinforcement Learning (RL), focusing on the objective of Conditional Value at Risk (CVa R) with risk tolerance τ. Starting with multi-arm bandits (MABs), we show the minimax CVa R regret rate is Ω(τ 1AK), where A is the number of actions and K is the number of episodes, and that it is achieved by an Upper Confidence Bound algorithm with a novel Bernstein bonus. For online RL in tabular Markov Decision Processes (MDPs), we show a minimax regret lower bound of Ω(τ 1SAK) (with normalized cumulative rewards), where S is the number of states, and we propose a novel bonus-driven Value Iteration procedure. We show that our algorithm achieves the optimal regret of e O(τ 1SAK) under a continuity assumption and in general attains a near-optimal regret of e O(τ 1SAK), which is minimax-optimal for constant τ. This improves on the best available bounds. |
| Researcher Affiliation | Academia | 1Computer Science, Cornell University 2Operations Research and Information Engineering, Cornell Tech. |
| Pseudocode | Yes | Algorithm 1 BERNSTEIN-UCB |
| Open Source Code | No | The paper includes a personal website link for correspondence: "Correspondence to: Kaiwen Wang <https://kaiwenw.github.io>.", but this is not an explicit statement of open-sourcing the code for the described methodology or a direct repository link. |
| Open Datasets | No | The paper is theoretical and focuses on statistical models like multi-arm bandits and tabular Markov Decision Processes (MDPs) rather than specific datasets. It defines reward distributions and return processes without referring to pre-existing, publicly available datasets. |
| Dataset Splits | No | The paper is theoretical and does not discuss training, validation, or test dataset splits. |
| Hardware Specification | No | The paper is theoretical and does not describe any specific hardware used for experiments. |
| Software Dependencies | No | The paper is theoretical and does not mention any specific software components or their version numbers, such as programming languages, libraries, or solvers used for implementation or experimentation. |
| Experiment Setup | No | The paper describes algorithms and theoretical guarantees but does not provide specific experimental setup details such as hyperparameters (e.g., learning rate, batch size) or system-level training settings. |