Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Statistical Tractability of Off-policy Evaluation of History-dependent Policies in POMDPs
Authors: Yuheng Zhang, Nan Jiang
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | In this work, we prove information-theoretic hardness for model-free OPE of history-dependent policies in several settings, characterized by additional assumptions imposed on the behavior policy (memoryless vs. history-dependent) and/or the state-revealing property of the POMDP (single-step vs. multi-step revealing). We further show that some hardness can be circumvented by a natural model-based algorithm whose analysis has surprisingly eluded the literature despite the algorithm s simplicity demonstrating provable separation between model-free and model-based OPE in POMDPs. |
| Researcher Affiliation | Academia | Yuheng Zhang & Nan Jiang University of Illinois Urbana-Champaign EMAIL |
| Pseudocode | No | Concretely, we consider a very natural algorithm based on Maximum Likelihood Estimation (MLE). It is, in fact, quite surprising that such a simple algorithm has not been analyzed under relatively general assumptions for OPE in POMDPs. The algorithm is as follows: Pre-filtering: We construct M M where all models that violate Assumption E (for Section 4.1 where we assume single-step revealing in the guarantee) or D (for Section 4.2 where we assume multi-step revealing) are excluded from M .8 Such a filtering step is common in the POMDP learning literature (Liu et al., 2022a), and in Section 5 we show why MLE estimation requires this as a pre-processing step and will fail otherwise. MLE: Let Pπ M stands for the probabilities under policy π and model M. The model is learned as c M = max M M Pn i=1 log Pπb M(τ (i) H ). (1) Prediction: We use the expected return of πe in c M, Jc M(πe), as our estimation for J(πe). While the paper describes the steps of an algorithm, it does not present them in a structured pseudocode block with numbered steps or code-like formatting. |
| Open Source Code | No | The paper does not contain any explicit statements or links indicating the availability of open-source code for the described methodology. |
| Open Datasets | No | The paper is theoretical, focusing on proving information-theoretic hardness and deriving polynomial sample-complexity bounds for off-policy evaluation in POMDPs. It does not describe experiments performed on specific datasets, nor does it provide access information for any open datasets. |
| Dataset Splits | No | The paper is theoretical and does not conduct empirical studies using datasets, therefore, it does not describe any dataset splits (training, validation, or test). |
| Hardware Specification | No | The paper is theoretical and focuses on mathematical proofs and algorithms for off-policy evaluation in POMDPs. It does not describe any experimental setup that would require specific hardware, thus no hardware specifications are mentioned. |
| Software Dependencies | No | The paper is theoretical and focuses on mathematical proofs and algorithms. It does not describe any experimental implementation or software dependencies with specific version numbers. |
| Experiment Setup | No | The paper is theoretical and focuses on information-theoretic hardness and model-based algorithms for off-policy evaluation. It does not describe any empirical experiments, and thus no specific experimental setup details like hyperparameters or training configurations are provided. |