Abstract Reward Processes: Leveraging State Abstraction for Consistent Off-Policy Evaluation

Authors: Shreyas Chaudhari, Ameet Deshpande, Bruno C. da Silva, Philip S. Thomas

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we introduce STAR, a framework for OPE that encompasses a broad range of estimators which include existing OPE methods as special cases that achieve lower mean squared prediction errors. The best STAR estimator outperforms baselines in all twelve cases studied, and even the median STAR estimator surpasses the baselines in seven out of the twelve cases.
Researcher Affiliation Academia Shreyas Chaudhari University of Massachusetts schaudhari@cs.umass.edu Ameet Deshpande Princeton University asd@cs.princeton.edu Bruno Castro da Silva University of Massachusetts bsilva@cs.umass.edu Philip S. Thomas University of Massachusetts pthomas@cs.umass.edu
Pseudocode Yes Algorithm 1 Overview of STAR(ϕ, c)
Open Source Code Yes The code is available at: https://github.com/shreyasc-13/STAR. Anonymized code is submitted as a .zip file with the submission. The codebase will be made public upon acceptance.
Open Datasets Yes ICU-Sepsis is built from real-world medical records obtained from the MIMIC-III dataset [24]. MIMIC-III, a freely accessible critical care database. Scientific Data, 3(1):1 9, 2016.
Dataset Splits No Estimator selection presents a significant challenge for OPE [48] due to the unavailability of a validation set.
Hardware Specification Yes The experiments were run using 32 threads on Xeon E5-2680 CPUs on a computing cluster, bringing the total compute time to roughly 45000 compute hours.
Software Dependencies No The paper does not specify software dependencies with version numbers. It mentions "Open AI Gym" but no version, and "Min Atar testbed" without specific software versions for replication.
Experiment Setup Yes For the class of abstraction function, we observe that the simple method Clu STAR performs well across all domains, and hence we use it for all experiments. Clu STARtakes an input a single hyperparameter, the number of centroids initialized, denoted by |Z|. We evaluate the following configurations of Z and c for each domain: 1. Cart Pole: 35 estimators |Z| {2, 4, 8, 16, 32, 64, 128}, c {1, 2, 3, 4, 5}. 2. ICU-Sepsis: 25 estimators |Z| {2, 4, 8, 16, 32}, c {1, 2, 3, 4, 5}. 3. Asterix: 25 estimators |Z| {2, 4, 8, 16, 32}, c {1, 2, 3, 4, 5}.