Off-Policy Evaluation in Partially Observable Environments
Authors: Guy Tennenholtz, Uri Shalit, Shie Mannor10276-10283
AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the pitfalls of off-policy evaluation in POMDPs using a well-known off-policy method, Importance Sampling, and compare it with our result on synthetic medical data. In Section 6, we experiment with the results of Theorem 2 and the IS variant constructed in this section on a finite-sample dataset generated by a synthetic medical environment. In this work we experimented with a tabular environment. |
| Researcher Affiliation | Academia | Guy Tennenholtz Technion Institute of Technology Shie Mannor Technion Institute of Technology Uri Shalit Technion Institute of Technology |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any concrete access to source code (specific repository link, explicit code release statement, or code in supplementary materials) for the methodology described. |
| Open Datasets | No | The paper states: 'In our experiments we construct a synthetic medical environment...' and 'Ten million trajectories were sampled from the policy πb over a horizon of 4 time steps for each environment.' No concrete access information (link, DOI, repository name, formal citation) is provided for this synthetic dataset. |
| Dataset Splits | No | The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology). |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate the experiment. |
| Experiment Setup | Yes | We denote σ(x) = 1 1+e x . The environment consists of a patient s (observed) medical state z. ... The observed state space Z, unobserved state space U, and observation space O were composed of two binary features each. We run the experiment in three environments, corresponding to different settings of the vectors c meant to illustrate different behaviors of our methods. Ten million trajectories were sampled from the policy πb over a horizon of 4 time steps for each environment. Figure 3 depicts the cumulative reward of πe, πb, and their corresponding estimates according to Theorem 2 and the IS weights wk i = π(i) e (ai|ho i ) P b(ai|ho i ) , for different values of α. |