Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Breaking the Order Barrier: Off-Policy Evaluation for Confounded POMDPs

Authors: Qi Kuang, Jiayi Wang, Fan Zhou, Zhengling Qi

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 6 Simulation Results We conduct a simulation study to examine the behavior of the error |V(π) b V(π)| with respect to sample size n and horizon T. The primary objective is to provide empirical validation for our theoretical results. To this end, we use a relatively simple simulation setup that ensures clarity in demonstration. Specifically, we evaluate our approach in a simulated POMDP environment characterized by a one-dimensional discrete state/observation space, a discrete reward space, and binary actions. Concretely, we set A = {0, 1}, S = O = {0, 1, 2} for all t. The initial observation is given by O0 Unif({0, 1, 2}) and St Unif({0, 1, 2}), and the transition dynamic is given by Ot Pt( |St), where Pt(Ot|St) = 1{Ot = St}(1 3ϵ/2) + ϵ/2. The immediate reward is set to be Rt = 2/{1 + exp( 2St At 3) 1}. We collected offline data using a time-homogeneous behavioral policy πb t(1|St) = 1/{1 + exp( 0.6St + 1)} = 1 πb t(0|St). For experimental details, we set ϵ = 0.2, initialize bb V,T +1 = 0, estimate conditional probability matrices as described in (4), and iteratively compute the value function bb V,t over T steps according to (5). We evaluate two target policies. (1) For the memoryless target policy πt(1|Ot) = 1/{1 + exp( 0.8Ot + 1)} = 1 πt(0|Ot), the conditional probability Pat and Pat,rt,ot+1 are conditioned on O0. We evaluate its value using sample sizes n = 200, 400, . . . , 1000, and horizon lengths T = 20, 60, 100, 140. The results, shown in Figure 2(a), reveal a nearly linear relationship between |V(π) b V(π)| and T, which aligns with our theoretical results as shown in Theorem 3. (2) For the fully history-dependent target policy setting, to simplify computation, we fix the action space to a single action, A = {1}. In this case, πt(1|Ot, Ht 1) = 1 and the historical information Ht 1 is only used to estimate the conditional probability matrices. We evaluate the policy value using sample sizes n = 1000, 4000, 7000, 10000, and horizon lengths T = 2, 4, 6. Figure 2(b) presents the logarithm of |V(π) b V(π)| versus T. For n = 1000 setting, we observe noticeable fluctuations due to the increased size of the conditional probability matrices as T grows, which requires more samples to estimate each entry accurately. Nonetheless, across different n, we observe an approximately linear relationship between the logarithmic error and the horizon T. These experimental results are consistent with the theoretical findings presented in Theorem 3.
Researcher Affiliation Academia Qi Kuang School of Statistics and Data Science Jiangxi University of Finance and Economics Jiayi Wang Department of Mathematical Sciences University of Texas at Dallas Fan Zhou School of Statistics and Data Science Mo E Key Laboratory of Interdisciplinary Research of Computation and Economics Shanghai University of Finance and Economics Zhengling Qi School of Business George Washington University Corresponding authors: EMAIL; EMAIL
Pseudocode Yes Algorithm 1 Tabular Off-Policy Evaluation for Confounded POMDPs Input: Dataset D, the target policy {πt}T t=1, and initialize bb V,T +1 = 0. for t = T, . . . , 1 do Estimation of conditional probability: obtain b Pat and b Pat,rt,ot+1 by (4) Estimation of value functions: obtain bb V,t by (5) end for Output: obtain estimated policy value b V(π) by bb V,1.
Open Source Code Yes The code to implement the simulation is available at https://github.com/kuangqi927/Confoundedpomdp.
Open Datasets No We conduct a simulation study to examine the behavior of the error |V(π) b V(π)| with respect to sample size n and horizon T. The primary objective is to provide empirical validation for our theoretical results. To this end, we use a relatively simple simulation setup that ensures clarity in demonstration. Specifically, we evaluate our approach in a simulated POMDP environment characterized by a one-dimensional discrete state/observation space, a discrete reward space, and binary actions. Concretely, we set A = {0, 1}, S = O = {0, 1, 2} for all t. The initial observation is given by O0 Unif({0, 1, 2}) and St Unif({0, 1, 2}), and the transition dynamic is given by Ot Pt( |St), where Pt(Ot|St) = 1{Ot = St}(1 3ϵ/2) + ϵ/2. The immediate reward is set to be Rt = 2/{1 + exp( 2St At 3) 1}. We collected offline data using a time-homogeneous behavioral policy πb t(1|St) = 1/{1 + exp( 0.6St + 1)} = 1 πb t(0|St).
Dataset Splits No We collected offline data using a time-homogeneous behavioral policy {πb t}T t=1... D := {oi 0, (oi t, ai t, ri t)T t=1}n i=1, which consists of n i.i.d. samples drawn from P. We evaluate its value using sample sizes n = 200, 400, . . . , 1000, and horizon lengths T = 20, 60, 100, 140. The paper describes a simulation to generate data, and then uses various sample sizes, but does not specify explicit training/test/validation splits typically used for dataset evaluation.
Hardware Specification Yes All simulations were performed on an Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz.
Software Dependencies No The paper does not provide specific software names with version numbers for libraries or frameworks used in the implementation. It only mentions the availability of code for simulation.
Experiment Setup Yes For experimental details, we set ϵ = 0.2, initialize bb V,T +1 = 0, estimate conditional probability matrices as described in (4), and iteratively compute the value function bb V,t over T steps according to (5). Specifically, we evaluate our approach in a simulated POMDP environment characterized by a one-dimensional discrete state/observation space, a discrete reward space, and binary actions. Concretely, we set A = {0, 1}, S = O = {0, 1, 2} for all t. The initial observation is given by O0 Unif({0, 1, 2}) and St Unif({0, 1, 2}), and the transition dynamic is given by Ot Pt( |St), where Pt(Ot|St) = 1{Ot = St}(1 3ϵ/2) + ϵ/2. The immediate reward is set to be Rt = 2/{1 + exp( 2St At 3) 1}. We collected offline data using a time-homogeneous behavioral policy πb t(1|St) = 1/{1 + exp( 0.6St + 1)} = 1 πb t(0|St).