Model-based Reinforcement Learning for Confounded POMDPs

Authors: Mao Hong, Zhengling Qi, Yanxun Xu

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We propose a model-based offline reinforcement learning (RL) algorithm for confounded partially observable Markov decision processes (POMDPs) under general function approximations and show it is provably efficient under some technical conditions such as the partial coverage imposed on the offline data distribution. Specifically, we first establish a novel model-based identification result for learning the effect of any action on the reward and future transitions in the confounded POMDP. Using this identification result, we then design a nonparametric two-stage estimation procedure to construct an estimator for off-policy evaluation (OPE), which permits general function approximations. Finally, we learn the optimal policy by performing a conservative policy optimization within the confidence regions based on the proposed estimation procedure for OPE. Under some mild conditions, we establish a finite-sample upper bound on the suboptimality of the learned policy in finding the optimal one, which depends on the sample size and the length of horizons polynomially. ... In particular, since RKHS can be employed for modeling bridge functions, the bridge functions can be expressed as linear combinations of many feature functions, making the ERM a quadratic function with respect to the coefficients associated with the bridge functions. As a result, the estimators of the bridge functions will have closed forms, making them computationally tractable and applicable to subsequent tasks. To perform conservative policy optimization, the idea of an existing work that designed a practical pessimistic model-based algorithm in standard MDP contexts (Rigter et al., 2022) could be potentially adapted to our confounded POMDP settings. Second, in this paper, we focus on the case when the bridge functions are realizable (Assumption 4.1(d)), the estimated conditional density functions at stage 1 are consistent (Assumption 4.1(b)), and the empirical conditional mean operator at stage 1 is consistent (Assumption 4.1(c)). In other words, all the required function spaces for the bridge functions, conditional density functions, and conditional mean operators are sufficiently large so that there is no approximation error occurring in this work. It would be interesting to relax these assumptions and allow for approximation error.
Researcher Affiliation Academia 1Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, United States 2Department of Decision Sciences, George Washington University, Washington, DC, United States.
Pseudocode Yes We summarize the proposed algorithm in Algorithm 1. ... Algorithm 1 Conservative model-based policy optimization for POMDPs
Open Source Code No The paper does not include any statement or link indicating that the source code for the described methodology is publicly available. It states in the discussion: 'it would be intriguing to design a practical algorithm with further empirical evaluation to demonstrate the practical effectiveness of the proposed method.'
Open Datasets No The paper is theoretical and does not use a specific, publicly available dataset for experiments. It refers to 'a pre-collected dataset' in the offline setting but provides no access information for such a dataset.
Dataset Splits No The paper is theoretical and does not describe empirical experiments with training, test, or validation dataset splits. The term 'validation' is used in the context of theoretical validity ('demonstrate the validity of the proposed algorithm').
Hardware Specification No The paper is theoretical and does not describe any hardware used for experiments.
Software Dependencies No The paper mentions mathematical concepts like 'RKHS endowed kernel ridge regressions' but does not specify any software names with version numbers or other software dependencies.
Experiment Setup No The paper is theoretical and does not describe an experimental setup with specific hyperparameters or training details.