reproducibilityindex.ai

Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR

Authors: Kaiwen Wang, Nathan Kallus, Wen Sun

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	In this paper, we study risk-sensitive Reinforcement Learning (RL), focusing on the objective of Conditional Value at Risk (CVa R) with risk tolerance τ. Starting with multi-arm bandits (MABs), we show the minimax CVa R regret rate is Ω(τ 1AK), where A is the number of actions and K is the number of episodes, and that it is achieved by an Upper Confidence Bound algorithm with a novel Bernstein bonus. For online RL in tabular Markov Decision Processes (MDPs), we show a minimax regret lower bound of Ω(τ 1SAK) (with normalized cumulative rewards), where S is the number of states, and we propose a novel bonus-driven Value Iteration procedure. We show that our algorithm achieves the optimal regret of e O(τ 1SAK) under a continuity assumption and in general attains a near-optimal regret of e O(τ 1SAK), which is minimax-optimal for constant τ. This improves on the best available bounds.
Researcher Affiliation	Academia	1Computer Science, Cornell University 2Operations Research and Information Engineering, Cornell Tech.
Pseudocode	Yes	Algorithm 1 BERNSTEIN-UCB
Open Source Code	No	The paper includes a personal website link for correspondence: "Correspondence to: Kaiwen Wang <https://kaiwenw.github.io>.", but this is not an explicit statement of open-sourcing the code for the described methodology or a direct repository link.
Open Datasets	No	The paper is theoretical and focuses on statistical models like multi-arm bandits and tabular Markov Decision Processes (MDPs) rather than specific datasets. It defines reward distributions and return processes without referring to pre-existing, publicly available datasets.
Dataset Splits	No	The paper is theoretical and does not discuss training, validation, or test dataset splits.
Hardware Specification	No	The paper is theoretical and does not describe any specific hardware used for experiments.
Software Dependencies	No	The paper is theoretical and does not mention any specific software components or their version numbers, such as programming languages, libraries, or solvers used for implementation or experimentation.
Experiment Setup	No	The paper describes algorithms and theoretical guarantees but does not provide specific experimental setup details such as hyperparameters (e.g., learning rate, batch size) or system-level training settings.