Efficient Policy Evaluation with Offline Data Informed Behavior Policy Design

Authors: Shuze Liu, Shangtong Zhang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 7. Empirical Results In this section, we present empirical results comparing our methods against three baselines:
Researcher Affiliation Academia 1Department of Computer Science, University of Virginia. Correspondence to: Shuze Liu <shuzeliu@virginia.edu>.
Pseudocode Yes Algorithm 1 Offline Data Informed (ODI) algorithm
Open Source Code Yes Our implementation is made publicly available to facilitate future research1. 1https://github.com/Shuze Liu/Behavior-Policy-Design-for Policy-Evaluation
Open Datasets No The paper describes how the data was generated for Gridworld and Mu Jo Co environments but does not provide a link, DOI, specific repository name, or formal citation with authors and year for a publicly available dataset.
Dataset Splits No We split the offline data into a training set and a test set. We tune all hyperparameters offline based on the supervised learning loss and fitted Q-learning loss on the test set.
Hardware Specification No The paper does not explicitly describe the specific hardware used (e.g., GPU/CPU models, memory specifications) for running experiments.
Software Dependencies No The paper mentions using the Adam optimizer and the PPO algorithm with reference to other papers, but it does not specify version numbers for any software dependencies like programming languages, libraries, or frameworks used for the implementation.
Experiment Setup Yes All hyperparameters of our methods required to learn ˆµ are tuned offline and are the same across all Mu Jo Co and Gridworld experiments. With the Adam optimizer (Kingma & Ba, 2015), we search the learning rates in 2^-20, 2^-18, ..., 2^0 to minimize the loss on the offline data and use the learning rate 2^-10 on all learning processes.