Regret Bounds for Risk-Sensitive Reinforcement Learning
Authors: Osbert Bastani, Jason Yecheng Ma, Estelle Shen, Wanqiao Xu
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We prove the first regret bounds for reinforcement learning under a general class of risk-sensitive objectives including the popular CVa R objective. Our theory is based on a novel characterization of the CVa R objective as well as a novel optimistic MDP construction. Figure 1: Results on the frozen lake environment. Left: Regret of our algorithm vs. UCBVI (with expected return) and a greedy exploration strategy. Right: Regret of our algorithm across different α values. We show mean and standard deviation across five random seeds. |
| Researcher Affiliation | Academia | Osbert Bastani University of Pennsylvania obastani@seas.upenn.edu Yecheng Jason Ma University of Pennsylvania jasonyma@seas.upenn.edu Estelle Shen University of Pennsylvania pixna@sas.upenn.edu Wanqiao Xu Stanford University wanqiaox@stanford.edu |
| Pseudocode | Yes | Algorithm 1 Upper Confidence Bound Algorithm |
| Open Source Code | No | No statement or link regarding the availability of source code for the described methodology is provided in the paper. |
| Open Datasets | No | We consider a classic frozen lake problem with a finite horizon... The paper does not provide concrete access information (link, DOI, citation with authors/year) for a publicly available dataset specifically used for the frozen lake environment setup. |
| Dataset Splits | No | The paper operates in an episodic reinforcement learning setting and discusses the number of episodes (K) but does not specify dataset splits (e.g., train/validation/test percentages or counts) as typically found in supervised learning. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory, cloud instances) used for running experiments are mentioned in the paper. |
| Software Dependencies | No | No specific software dependencies with version numbers are mentioned in the paper. |
| Experiment Setup | Yes | We consider a classic frozen lake problem with a finite horizon. The agent moves to a block next to its current state at each timestep t and has a slipping probability of 0.1 in its moving direction if the next state is an ice block... We use a map with four paths of the same lengths that have different rewards at the end and different levels of risk of falling into holes. We consider α {0.40, 0.33, 0.25, 0.01}. |