Provably Efficient Partially Observable Risk-sensitive Reinforcement Learning with Hindsight Observation
Authors: Tonghe Zhang, Yu Chen, Longbo Huang
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we present experimental results to evaluate the empirical performance of our algorithm, BVVI, which serves as a reference to validate our theoretical findings in Section 8. |
| Researcher Affiliation | Academia | 1Department of Electronic Engineering, Tsinghua University, Beijing, China 2Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China. |
| Pseudocode | Yes | Algorithm 1 Beta Vector Value Iteration (BVVI) |
| Open Source Code | No | The paper does not contain any explicit statements or links indicating that the source code for the methodology is openly available. |
| Open Datasets | No | We designed a POMDP model with two actions (A = 2), three states (S = 3) and three observations (O = 3). The length of horizon H is set to be 4 and the agent interacts with the POMDP for K = 2, 000 episodes. |
| Dataset Splits | No | The paper describes interacting with a custom-designed POMDP environment for K=2,000 episodes, but it does not specify explicit train/validation/test dataset splits needed for reproduction in a traditional sense. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details, such as library or solver names with version numbers, needed to replicate the experiment. |
| Experiment Setup | Yes | We designed a POMDP model with two actions (A = 2), three states (S = 3) and three observations (O = 3). The length of horizon H is set to be 4 and the agent interacts with the POMDP for K = 2, 000 episodes. We chose confidence level δ = 0.2 and set the risk sensitivity parameter γ to 1.0. For simplicity, in the POMDP, the environment starts deterministically at state 1 and evolves in a time-homogeneous fashion. At each step, the environment transitions from state 1 to state 1, 2 and 3 with probabilities 0.03, 0.95, and 0.02, respectively. When starting from state 2, it transitions to states 1, 2, and 3 with probabilities 0.04, 0.02, and 0.94. The state-transition probabilities become 0.89, 0.10, and 0.01 when starting from state 3. In each state, the agents receives observations 1, 2 or 3 with probabilities 0.83, 0.08, and 0.09, respectively. The observation distributions become 0.05, 0.79, 0.06 or 0.02, 0.09, 0.89 when in states 2 or 3, respectively. The agent receives a reward of 1 when she takes action 1 in state 1, action 2 in state 2, or action 1 in state 3. In other cases, the agent gains a reward of 0. |