Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Statistical Tractability of Off-policy Evaluation of History-dependent Policies in POMDPs

Authors: Yuheng Zhang, Nan Jiang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical In this work, we prove information-theoretic hardness for model-free OPE of history-dependent policies in several settings, characterized by additional assumptions imposed on the behavior policy (memoryless vs. history-dependent) and/or the state-revealing property of the POMDP (single-step vs. multi-step revealing). We further show that some hardness can be circumvented by a natural model-based algorithm whose analysis has surprisingly eluded the literature despite the algorithm s simplicity demonstrating provable separation between model-free and model-based OPE in POMDPs.
Researcher Affiliation Academia Yuheng Zhang & Nan Jiang University of Illinois Urbana-Champaign EMAIL
Pseudocode No Concretely, we consider a very natural algorithm based on Maximum Likelihood Estimation (MLE). It is, in fact, quite surprising that such a simple algorithm has not been analyzed under relatively general assumptions for OPE in POMDPs. The algorithm is as follows: Pre-filtering: We construct M M where all models that violate Assumption E (for Section 4.1 where we assume single-step revealing in the guarantee) or D (for Section 4.2 where we assume multi-step revealing) are excluded from M .8 Such a filtering step is common in the POMDP learning literature (Liu et al., 2022a), and in Section 5 we show why MLE estimation requires this as a pre-processing step and will fail otherwise. MLE: Let Pπ M stands for the probabilities under policy π and model M. The model is learned as c M = max M M Pn i=1 log Pπb M(τ (i) H ). (1) Prediction: We use the expected return of πe in c M, Jc M(πe), as our estimation for J(πe). While the paper describes the steps of an algorithm, it does not present them in a structured pseudocode block with numbered steps or code-like formatting.
Open Source Code No The paper does not contain any explicit statements or links indicating the availability of open-source code for the described methodology.
Open Datasets No The paper is theoretical, focusing on proving information-theoretic hardness and deriving polynomial sample-complexity bounds for off-policy evaluation in POMDPs. It does not describe experiments performed on specific datasets, nor does it provide access information for any open datasets.
Dataset Splits No The paper is theoretical and does not conduct empirical studies using datasets, therefore, it does not describe any dataset splits (training, validation, or test).
Hardware Specification No The paper is theoretical and focuses on mathematical proofs and algorithms for off-policy evaluation in POMDPs. It does not describe any experimental setup that would require specific hardware, thus no hardware specifications are mentioned.
Software Dependencies No The paper is theoretical and focuses on mathematical proofs and algorithms. It does not describe any experimental implementation or software dependencies with specific version numbers.
Experiment Setup No The paper is theoretical and focuses on information-theoretic hardness and model-based algorithms for off-policy evaluation. It does not describe any empirical experiments, and thus no specific experimental setup details like hyperparameters or training configurations are provided.