Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Statistical Tractability of Off-policy Evaluation of History-dependent Policies in POMDPs

Authors: Yuheng Zhang, Nan Jiang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	In this work, we prove information-theoretic hardness for model-free OPE of history-dependent policies in several settings, characterized by additional assumptions imposed on the behavior policy (memoryless vs. history-dependent) and/or the state-revealing property of the POMDP (single-step vs. multi-step revealing). We further show that some hardness can be circumvented by a natural model-based algorithm whose analysis has surprisingly eluded the literature despite the algorithm s simplicity demonstrating provable separation between model-free and model-based OPE in POMDPs.
Researcher Affiliation	Academia	Yuheng Zhang & Nan Jiang University of Illinois Urbana-Champaign EMAIL
Pseudocode	No	Concretely, we consider a very natural algorithm based on Maximum Likelihood Estimation (MLE). It is, in fact, quite surprising that such a simple algorithm has not been analyzed under relatively general assumptions for OPE in POMDPs. The algorithm is as follows: Pre-filtering: We construct M M where all models that violate Assumption E (for Section 4.1 where we assume single-step revealing in the guarantee) or D (for Section 4.2 where we assume multi-step revealing) are excluded from M .8 Such a filtering step is common in the POMDP learning literature (Liu et al., 2022a), and in Section 5 we show why MLE estimation requires this as a pre-processing step and will fail otherwise. MLE: Let Pπ M stands for the probabilities under policy π and model M. The model is learned as c M = max M M Pn i=1 log Pπb M(τ (i) H ). (1) Prediction: We use the expected return of πe in c M, Jc M(πe), as our estimation for J(πe). While the paper describes the steps of an algorithm, it does not present them in a structured pseudocode block with numbered steps or code-like formatting.
Open Source Code	No	The paper does not contain any explicit statements or links indicating the availability of open-source code for the described methodology.
Open Datasets	No	The paper is theoretical, focusing on proving information-theoretic hardness and deriving polynomial sample-complexity bounds for off-policy evaluation in POMDPs. It does not describe experiments performed on specific datasets, nor does it provide access information for any open datasets.
Dataset Splits	No	The paper is theoretical and does not conduct empirical studies using datasets, therefore, it does not describe any dataset splits (training, validation, or test).
Hardware Specification	No	The paper is theoretical and focuses on mathematical proofs and algorithms for off-policy evaluation in POMDPs. It does not describe any experimental setup that would require specific hardware, thus no hardware specifications are mentioned.
Software Dependencies	No	The paper is theoretical and focuses on mathematical proofs and algorithms. It does not describe any experimental implementation or software dependencies with specific version numbers.
Experiment Setup	No	The paper is theoretical and focuses on information-theoretic hardness and model-based algorithms for off-policy evaluation. It does not describe any empirical experiments, and thus no specific experimental setup details like hyperparameters or training configurations are provided.