Model-based Reinforcement Learning for Confounded POMDPs
Authors: Mao Hong, Zhengling Qi, Yanxun Xu
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | We propose a model-based offline reinforcement learning (RL) algorithm for confounded partially observable Markov decision processes (POMDPs) under general function approximations and show it is provably efficient under some technical conditions such as the partial coverage imposed on the offline data distribution. Specifically, we first establish a novel model-based identification result for learning the effect of any action on the reward and future transitions in the confounded POMDP. Using this identification result, we then design a nonparametric two-stage estimation procedure to construct an estimator for off-policy evaluation (OPE), which permits general function approximations. Finally, we learn the optimal policy by performing a conservative policy optimization within the confidence regions based on the proposed estimation procedure for OPE. Under some mild conditions, we establish a finite-sample upper bound on the suboptimality of the learned policy in finding the optimal one, which depends on the sample size and the length of horizons polynomially. ... In particular, since RKHS can be employed for modeling bridge functions, the bridge functions can be expressed as linear combinations of many feature functions, making the ERM a quadratic function with respect to the coefficients associated with the bridge functions. As a result, the estimators of the bridge functions will have closed forms, making them computationally tractable and applicable to subsequent tasks. To perform conservative policy optimization, the idea of an existing work that designed a practical pessimistic model-based algorithm in standard MDP contexts (Rigter et al., 2022) could be potentially adapted to our confounded POMDP settings. Second, in this paper, we focus on the case when the bridge functions are realizable (Assumption 4.1(d)), the estimated conditional density functions at stage 1 are consistent (Assumption 4.1(b)), and the empirical conditional mean operator at stage 1 is consistent (Assumption 4.1(c)). In other words, all the required function spaces for the bridge functions, conditional density functions, and conditional mean operators are sufficiently large so that there is no approximation error occurring in this work. It would be interesting to relax these assumptions and allow for approximation error. |
| Researcher Affiliation | Academia | 1Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, United States 2Department of Decision Sciences, George Washington University, Washington, DC, United States. |
| Pseudocode | Yes | We summarize the proposed algorithm in Algorithm 1. ... Algorithm 1 Conservative model-based policy optimization for POMDPs |
| Open Source Code | No | The paper does not include any statement or link indicating that the source code for the described methodology is publicly available. It states in the discussion: 'it would be intriguing to design a practical algorithm with further empirical evaluation to demonstrate the practical effectiveness of the proposed method.' |
| Open Datasets | No | The paper is theoretical and does not use a specific, publicly available dataset for experiments. It refers to 'a pre-collected dataset' in the offline setting but provides no access information for such a dataset. |
| Dataset Splits | No | The paper is theoretical and does not describe empirical experiments with training, test, or validation dataset splits. The term 'validation' is used in the context of theoretical validity ('demonstrate the validity of the proposed algorithm'). |
| Hardware Specification | No | The paper is theoretical and does not describe any hardware used for experiments. |
| Software Dependencies | No | The paper mentions mathematical concepts like 'RKHS endowed kernel ridge regressions' but does not specify any software names with version numbers or other software dependencies. |
| Experiment Setup | No | The paper is theoretical and does not describe an experimental setup with specific hyperparameters or training details. |