Risk-Sensitive Reward-Free Reinforcement Learning with CVaR
Authors: Xinyi Ni, Guanlin Liu, Lifeng Lai
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The effectiveness and practicality of our CVa R reward-free approach are further validated through numerical experiments. In this section, we provide numerical examples to evaluate the proposed CVa R-RF RL framework. In these examples, we use similar experimental setup as in (Kaufmann et al., 2021). Our environment is configured as a grid-world consisting of 21 × 21 states, where each state offers four possible actions (up, down, left, right), and actions leading to the boundary result in remaining in the current state. The agent will move to the correct state with a probability of 0.95. However, there is an equal probability of 0.05/3 for the agent to move in any one of the other three directions. Initially, the exploration algorithm CVa R-RF-UCRL runs without reward information, collecting n = 30, 000 transitions. The empirical transition probability Pˆ is then estimated. We use the β(n, δ) threshold from Theorem 4.7 with δ = 0.1 and set a time horizon H of 20. Using the obtained dataset and Pˆ, the planning algorithm derives near-optimal policies, employing CVa R-VI-DISC as the solver. |
| Researcher Affiliation | Academia | 1Department of Electrical and Computer Engineering, University of California, Daivs, Davis, USA. Correspondence to: Xinyi Ni <xni@ucdavis.edu>. |
| Pseudocode | Yes | Algorithm 1 CVa R-RF-UCRL; Algorithm 2 CVa R-RF-Planning; Algorithm 3 CVa R-VI; Algorithm 4 CVa R-VI-DISC |
| Open Source Code | No | The paper does not provide any explicit statements about releasing source code or links to repositories. |
| Open Datasets | No | The paper describes a synthetic grid-world environment and does not mention using a publicly available or open dataset. It states: "Our environment is configured as a grid-world consisting of 21 × 21 states, where each state offers four possible actions (up, down, left, right), and actions leading to the boundary result in remaining in the current state." |
| Dataset Splits | No | The paper does not explicitly provide training/test/validation dataset splits, as it uses a simulated grid-world environment rather than a traditional dataset split. It describes collecting transitions in the environment for exploration and then estimating empirical transition probabilities. |
| Hardware Specification | No | The paper does not explicitly describe the hardware used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software dependencies. It describes the algorithms used (e.g., CVa R-VI-DISC as the solver) but not the underlying software stack with versions. |
| Experiment Setup | Yes | In these examples, we use similar experimental setup as in (Kaufmann et al., 2021). Our environment is configured as a grid-world consisting of 21 × 21 states, where each state offers four possible actions (up, down, left, right), and actions leading to the boundary result in remaining in the current state. The agent will move to the correct state with a probability of 0.95. However, there is an equal probability of 0.05/3 for the agent to move in any one of the other three directions. Initially, the exploration algorithm CVa R-RF-UCRL runs without reward information, collecting n = 30, 000 transitions. The empirical transition probability Pˆ is then estimated. We use the β(n, δ) threshold from Theorem 4.7 with δ = 0.1 and set a time horizon H of 20. Using the obtained dataset and Pˆ, the planning algorithm derives near-optimal policies, employing CVa R-VI-DISC as the solver. Reward Setup 1: The first one is similar with (Kaufmann et al., 2021), where the agent starts at position (10, 10). The reward structure is primarily set at 0 for most states, except at (16, 16) where it is 1.0. Here we choose ϵ = 0.1. Then we executing the output policy of CVa R-VI-DISC in the same grid-world for K = 10, 000 trajectories and plot the number of state visits following the policy. For comparison, we also generate the optimal policy using true transition probability. |