Provably Efficient Partially Observable Risk-sensitive Reinforcement Learning with Hindsight Observation

Authors: Tonghe Zhang, Yu Chen, Longbo Huang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we present experimental results to evaluate the empirical performance of our algorithm, BVVI, which serves as a reference to validate our theoretical findings in Section 8.
Researcher Affiliation Academia 1Department of Electronic Engineering, Tsinghua University, Beijing, China 2Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China.
Pseudocode Yes Algorithm 1 Beta Vector Value Iteration (BVVI)
Open Source Code No The paper does not contain any explicit statements or links indicating that the source code for the methodology is openly available.
Open Datasets No We designed a POMDP model with two actions (A = 2), three states (S = 3) and three observations (O = 3). The length of horizon H is set to be 4 and the agent interacts with the POMDP for K = 2, 000 episodes.
Dataset Splits No The paper describes interacting with a custom-designed POMDP environment for K=2,000 episodes, but it does not specify explicit train/validation/test dataset splits needed for reproduction in a traditional sense.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details, such as library or solver names with version numbers, needed to replicate the experiment.
Experiment Setup Yes We designed a POMDP model with two actions (A = 2), three states (S = 3) and three observations (O = 3). The length of horizon H is set to be 4 and the agent interacts with the POMDP for K = 2, 000 episodes. We chose confidence level δ = 0.2 and set the risk sensitivity parameter γ to 1.0. For simplicity, in the POMDP, the environment starts deterministically at state 1 and evolves in a time-homogeneous fashion. At each step, the environment transitions from state 1 to state 1, 2 and 3 with probabilities 0.03, 0.95, and 0.02, respectively. When starting from state 2, it transitions to states 1, 2, and 3 with probabilities 0.04, 0.02, and 0.94. The state-transition probabilities become 0.89, 0.10, and 0.01 when starting from state 3. In each state, the agents receives observations 1, 2 or 3 with probabilities 0.83, 0.08, and 0.09, respectively. The observation distributions become 0.05, 0.79, 0.06 or 0.02, 0.09, 0.89 when in states 2 or 3, respectively. The agent receives a reward of 1 when she takes action 1 in state 1, action 2 in state 2, or action 1 in state 3. In other cases, the agent gains a reward of 0.