Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Future-Dependent Value-Based Off-Policy Evaluation in POMDPs

Authors: Masatoshi Uehara, Haruka Kiyohara, Andrew Bennett, Victor Chernozhukov, Nan Jiang, Nathan Kallus, Chengchun Shi, Wen Sun

NeurIPS 2023 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide PAC guarantees to demonstrate that our method can address the curse of horizon, and conduct numerical experiments to showcase its superiority over existing methods. This section empirically evaluates the performance of the proposed method on a synthetic dataset.
Researcher Affiliation Collaboration Masatoshi Uehara Genentech EMAIL Haruka Kiyohara Cornell University EMAIL Andrew Bennett Morgan Stanley EMAIL Victor Chernozhukov MIT EMAIL Nan Jiang UIUC EMAIL Nathan Kallus Cornell University EMAIL Chengchun Shi LSE EMAIL Wen Sun Cornell University EMAIL
Pseudocode Yes Algorithm 1 Minimax OPE on POMDPs
Open Source Code Yes Our code is available at https://github.com/aiueola/neurips2023-future-dependent-ope.
Open Datasets Yes We use the Cart Pole environment provided by Open AI Gym [BCP+16], which is commonly employed in other OPE studies [SUHJ22, FCG18].
Dataset Splits No No explicit statements about specific training, validation, or test dataset splits (e.g., percentages, sample counts, or predefined splits) were found. The paper mentions data collection details like '1000 trajectories, each containing 100 steps', but not how this data is partitioned for training, validation, and testing.
Hardware Specification No No specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running experiments were mentioned in the paper.
Software Dependencies No The paper mentions software components like 'DDQN', 'two-layer neural networks', and 'RKHSs', and the 'Open AI Gym' environment, but does not provide specific version numbers for any of these or other key software dependencies.
Experiment Setup Yes Both our proposed method and the naive approach use two-layer neural networks for the function Q and RKHSs for Ξ, as detailed in Example 4. In contrast, our proposed method uses a 3-step history as H and a one-step future as F to address partial observability. ... we set ϵ = 0.3. Similarly, the evaluation policy is also an ϵ-greedy policy, based on the base policy obtained by BC on the observation-action pairs (O, A), with different values of ϵ [0.1, 0.3, 0.5, 0.7].