Regret Bounds for Risk-Sensitive Reinforcement Learning

Authors: Osbert Bastani, Jason Yecheng Ma, Estelle Shen, Wanqiao Xu

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We prove the first regret bounds for reinforcement learning under a general class of risk-sensitive objectives including the popular CVa R objective. Our theory is based on a novel characterization of the CVa R objective as well as a novel optimistic MDP construction. Figure 1: Results on the frozen lake environment. Left: Regret of our algorithm vs. UCBVI (with expected return) and a greedy exploration strategy. Right: Regret of our algorithm across different α values. We show mean and standard deviation across five random seeds.
Researcher Affiliation Academia Osbert Bastani University of Pennsylvania obastani@seas.upenn.edu Yecheng Jason Ma University of Pennsylvania jasonyma@seas.upenn.edu Estelle Shen University of Pennsylvania pixna@sas.upenn.edu Wanqiao Xu Stanford University wanqiaox@stanford.edu
Pseudocode Yes Algorithm 1 Upper Confidence Bound Algorithm
Open Source Code No No statement or link regarding the availability of source code for the described methodology is provided in the paper.
Open Datasets No We consider a classic frozen lake problem with a finite horizon... The paper does not provide concrete access information (link, DOI, citation with authors/year) for a publicly available dataset specifically used for the frozen lake environment setup.
Dataset Splits No The paper operates in an episodic reinforcement learning setting and discusses the number of episodes (K) but does not specify dataset splits (e.g., train/validation/test percentages or counts) as typically found in supervised learning.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory, cloud instances) used for running experiments are mentioned in the paper.
Software Dependencies No No specific software dependencies with version numbers are mentioned in the paper.
Experiment Setup Yes We consider a classic frozen lake problem with a finite horizon. The agent moves to a block next to its current state at each timestep t and has a slipping probability of 0.1 in its moving direction if the next state is an ice block... We use a map with four paths of the same lengths that have different rewards at the end and different levels of risk of falling into holes. We consider α {0.40, 0.33, 0.25, 0.01}.