Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR

Authors: Kaiwen Wang, Nathan Kallus, Wen Sun

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical In this paper, we study risk-sensitive Reinforcement Learning (RL), focusing on the objective of Conditional Value at Risk (CVa R) with risk tolerance τ. Starting with multi-arm bandits (MABs), we show the minimax CVa R regret rate is Ω(τ 1AK), where A is the number of actions and K is the number of episodes, and that it is achieved by an Upper Confidence Bound algorithm with a novel Bernstein bonus. For online RL in tabular Markov Decision Processes (MDPs), we show a minimax regret lower bound of Ω(τ 1SAK) (with normalized cumulative rewards), where S is the number of states, and we propose a novel bonus-driven Value Iteration procedure. We show that our algorithm achieves the optimal regret of e O(τ 1SAK) under a continuity assumption and in general attains a near-optimal regret of e O(τ 1SAK), which is minimax-optimal for constant τ. This improves on the best available bounds.
Researcher Affiliation Academia 1Computer Science, Cornell University 2Operations Research and Information Engineering, Cornell Tech.
Pseudocode Yes Algorithm 1 BERNSTEIN-UCB
Open Source Code No The paper includes a personal website link for correspondence: "Correspondence to: Kaiwen Wang <https://kaiwenw.github.io>.", but this is not an explicit statement of open-sourcing the code for the described methodology or a direct repository link.
Open Datasets No The paper is theoretical and focuses on statistical models like multi-arm bandits and tabular Markov Decision Processes (MDPs) rather than specific datasets. It defines reward distributions and return processes without referring to pre-existing, publicly available datasets.
Dataset Splits No The paper is theoretical and does not discuss training, validation, or test dataset splits.
Hardware Specification No The paper is theoretical and does not describe any specific hardware used for experiments.
Software Dependencies No The paper is theoretical and does not mention any specific software components or their version numbers, such as programming languages, libraries, or solvers used for implementation or experimentation.
Experiment Setup No The paper describes algorithms and theoretical guarantees but does not provide specific experimental setup details such as hyperparameters (e.g., learning rate, batch size) or system-level training settings.