Future-Dependent Value-Based Off-Policy Evaluation in POMDPs
Authors: Masatoshi Uehara, Haruka Kiyohara, Andrew Bennett, Victor Chernozhukov, Nan Jiang, Nathan Kallus, Chengchun Shi, Wen Sun
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide PAC guarantees to demonstrate that our method can address the curse of horizon, and conduct numerical experiments to showcase its superiority over existing methods. This section empirically evaluates the performance of the proposed method on a synthetic dataset. |
| Researcher Affiliation | Collaboration | Masatoshi Uehara Genentech uehara.masatoshi@gene.com Haruka Kiyohara Cornell University hk844@cornell.edu Andrew Bennett Morgan Stanley Andrew.Bennett@morganstanley.com Victor Chernozhukov MIT vchern@mit.edu Nan Jiang UIUC nanjiang@illinois.edu Nathan Kallus Cornell University kallus@cornell.edu Chengchun Shi LSE c.shi7@lse.ac.uk Wen Sun Cornell University ws455@cornell.edu |
| Pseudocode | Yes | Algorithm 1 Minimax OPE on POMDPs |
| Open Source Code | Yes | Our code is available at https://github.com/aiueola/neurips2023-future-dependent-ope. |
| Open Datasets | Yes | We use the Cart Pole environment provided by Open AI Gym [BCP+16], which is commonly employed in other OPE studies [SUHJ22, FCG18]. |
| Dataset Splits | No | No explicit statements about specific training, validation, or test dataset splits (e.g., percentages, sample counts, or predefined splits) were found. The paper mentions data collection details like '1000 trajectories, each containing 100 steps', but not how this data is partitioned for training, validation, and testing. |
| Hardware Specification | No | No specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running experiments were mentioned in the paper. |
| Software Dependencies | No | The paper mentions software components like 'DDQN', 'two-layer neural networks', and 'RKHSs', and the 'Open AI Gym' environment, but does not provide specific version numbers for any of these or other key software dependencies. |
| Experiment Setup | Yes | Both our proposed method and the naive approach use two-layer neural networks for the function Q and RKHSs for Ξ, as detailed in Example 4. In contrast, our proposed method uses a 3-step history as H and a one-step future as F to address partial observability. ... we set ϵ = 0.3. Similarly, the evaluation policy is also an ϵ-greedy policy, based on the base policy obtained by BC on the observation-action pairs (O, A), with different values of ϵ [0.1, 0.3, 0.5, 0.7]. |