Future-Dependent Value-Based Off-Policy Evaluation in POMDPs

Authors: Masatoshi Uehara, Haruka Kiyohara, Andrew Bennett, Victor Chernozhukov, Nan Jiang, Nathan Kallus, Chengchun Shi, Wen Sun

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide PAC guarantees to demonstrate that our method can address the curse of horizon, and conduct numerical experiments to showcase its superiority over existing methods. This section empirically evaluates the performance of the proposed method on a synthetic dataset.
Researcher Affiliation Collaboration Masatoshi Uehara Genentech uehara.masatoshi@gene.com Haruka Kiyohara Cornell University hk844@cornell.edu Andrew Bennett Morgan Stanley Andrew.Bennett@morganstanley.com Victor Chernozhukov MIT vchern@mit.edu Nan Jiang UIUC nanjiang@illinois.edu Nathan Kallus Cornell University kallus@cornell.edu Chengchun Shi LSE c.shi7@lse.ac.uk Wen Sun Cornell University ws455@cornell.edu
Pseudocode Yes Algorithm 1 Minimax OPE on POMDPs
Open Source Code Yes Our code is available at https://github.com/aiueola/neurips2023-future-dependent-ope.
Open Datasets Yes We use the Cart Pole environment provided by Open AI Gym [BCP+16], which is commonly employed in other OPE studies [SUHJ22, FCG18].
Dataset Splits No No explicit statements about specific training, validation, or test dataset splits (e.g., percentages, sample counts, or predefined splits) were found. The paper mentions data collection details like '1000 trajectories, each containing 100 steps', but not how this data is partitioned for training, validation, and testing.
Hardware Specification No No specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running experiments were mentioned in the paper.
Software Dependencies No The paper mentions software components like 'DDQN', 'two-layer neural networks', and 'RKHSs', and the 'Open AI Gym' environment, but does not provide specific version numbers for any of these or other key software dependencies.
Experiment Setup Yes Both our proposed method and the naive approach use two-layer neural networks for the function Q and RKHSs for Ξ, as detailed in Example 4. In contrast, our proposed method uses a 3-step history as H and a one-step future as F to address partial observability. ... we set ϵ = 0.3. Similarly, the evaluation policy is also an ϵ-greedy policy, based on the base policy obtained by BC on the observation-action pairs (O, A), with different values of ϵ [0.1, 0.3, 0.5, 0.7].