Risk-Sensitive Reward-Free Reinforcement Learning with CVaR

Authors: Xinyi Ni, Guanlin Liu, Lifeng Lai

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The effectiveness and practicality of our CVa R reward-free approach are further validated through numerical experiments. In this section, we provide numerical examples to evaluate the proposed CVa R-RF RL framework. In these examples, we use similar experimental setup as in (Kaufmann et al., 2021). Our environment is configured as a grid-world consisting of 21 × 21 states, where each state offers four possible actions (up, down, left, right), and actions leading to the boundary result in remaining in the current state. The agent will move to the correct state with a probability of 0.95. However, there is an equal probability of 0.05/3 for the agent to move in any one of the other three directions. Initially, the exploration algorithm CVa R-RF-UCRL runs without reward information, collecting n = 30, 000 transitions. The empirical transition probability Pˆ is then estimated. We use the β(n, δ) threshold from Theorem 4.7 with δ = 0.1 and set a time horizon H of 20. Using the obtained dataset and Pˆ, the planning algorithm derives near-optimal policies, employing CVa R-VI-DISC as the solver.
Researcher Affiliation Academia 1Department of Electrical and Computer Engineering, University of California, Daivs, Davis, USA. Correspondence to: Xinyi Ni <xni@ucdavis.edu>.
Pseudocode Yes Algorithm 1 CVa R-RF-UCRL; Algorithm 2 CVa R-RF-Planning; Algorithm 3 CVa R-VI; Algorithm 4 CVa R-VI-DISC
Open Source Code No The paper does not provide any explicit statements about releasing source code or links to repositories.
Open Datasets No The paper describes a synthetic grid-world environment and does not mention using a publicly available or open dataset. It states: "Our environment is configured as a grid-world consisting of 21 × 21 states, where each state offers four possible actions (up, down, left, right), and actions leading to the boundary result in remaining in the current state."
Dataset Splits No The paper does not explicitly provide training/test/validation dataset splits, as it uses a simulated grid-world environment rather than a traditional dataset split. It describes collecting transitions in the environment for exploration and then estimating empirical transition probabilities.
Hardware Specification No The paper does not explicitly describe the hardware used for running its experiments.
Software Dependencies No The paper does not provide specific version numbers for any software dependencies. It describes the algorithms used (e.g., CVa R-VI-DISC as the solver) but not the underlying software stack with versions.
Experiment Setup Yes In these examples, we use similar experimental setup as in (Kaufmann et al., 2021). Our environment is configured as a grid-world consisting of 21 × 21 states, where each state offers four possible actions (up, down, left, right), and actions leading to the boundary result in remaining in the current state. The agent will move to the correct state with a probability of 0.95. However, there is an equal probability of 0.05/3 for the agent to move in any one of the other three directions. Initially, the exploration algorithm CVa R-RF-UCRL runs without reward information, collecting n = 30, 000 transitions. The empirical transition probability Pˆ is then estimated. We use the β(n, δ) threshold from Theorem 4.7 with δ = 0.1 and set a time horizon H of 20. Using the obtained dataset and Pˆ, the planning algorithm derives near-optimal policies, employing CVa R-VI-DISC as the solver. Reward Setup 1: The first one is similar with (Kaufmann et al., 2021), where the agent starts at position (10, 10). The reward structure is primarily set at 0 for most states, except at (16, 16) where it is 1.0. Here we choose ϵ = 0.1. Then we executing the output policy of CVa R-VI-DISC in the same grid-world for K = 10, 000 trajectories and plot the number of state visits following the policy. For comparison, we also generate the optimal policy using true transition probability.