Pessimism Meets Risk: Risk-Sensitive Offline Reinforcement Learning
Authors: Dake Zhang, Boxiang Lyu, Shuang Qiu, Mladen Kolar, Tong Zhang
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | For completeness, we examine a variant of the Model Win MDP introduced in Thomas & Brunskill (2016) to verify theoretical findings. ... The suboptimality results are reported in Figure 1. We can see that with a larger K, the suboptimality goes to 0, which serves as simulation evidence for our algorithm. |
| Researcher Affiliation | Academia | 1University of Chicago, IL, USA 2Hong Kong University of Science and Technology, Hong Kong, China 3University of Southern California, CA, USA 4University of Illinois Urbana-Champaign, IL, USA |
| Pseudocode | Yes | Algorithm 1 RSPVI Algorithm ... Algorithm 2 VA-RSPVI Algorithm |
| Open Source Code | No | The paper does not contain any explicit statement about releasing source code or a link to a code repository for the described methodology. |
| Open Datasets | Yes | For completeness, we examine a variant of the Model Win MDP introduced in Thomas & Brunskill (2016) to verify theoretical findings. |
| Dataset Splits | No | The paper describes the MDP environment and data generation process (offline dataset D consisting of K trajectories), but it does not specify explicit training, validation, and test dataset splits for the algorithms. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers. |
| Experiment Setup | Yes | We evaluate the scenarios H = 5, 10, 15, 20 and β = 0.5, 1 in the experiment. ... The behavior policy we use to generate the offline data is taking a1 and a2 randomly with equal probability. |