Efficient Policy Evaluation with Offline Data Informed Behavior Policy Design
Authors: Shuze Liu, Shangtong Zhang
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 7. Empirical Results In this section, we present empirical results comparing our methods against three baselines: |
| Researcher Affiliation | Academia | 1Department of Computer Science, University of Virginia. Correspondence to: Shuze Liu <shuzeliu@virginia.edu>. |
| Pseudocode | Yes | Algorithm 1 Offline Data Informed (ODI) algorithm |
| Open Source Code | Yes | Our implementation is made publicly available to facilitate future research1. 1https://github.com/Shuze Liu/Behavior-Policy-Design-for Policy-Evaluation |
| Open Datasets | No | The paper describes how the data was generated for Gridworld and Mu Jo Co environments but does not provide a link, DOI, specific repository name, or formal citation with authors and year for a publicly available dataset. |
| Dataset Splits | No | We split the offline data into a training set and a test set. We tune all hyperparameters offline based on the supervised learning loss and fitted Q-learning loss on the test set. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware used (e.g., GPU/CPU models, memory specifications) for running experiments. |
| Software Dependencies | No | The paper mentions using the Adam optimizer and the PPO algorithm with reference to other papers, but it does not specify version numbers for any software dependencies like programming languages, libraries, or frameworks used for the implementation. |
| Experiment Setup | Yes | All hyperparameters of our methods required to learn ˆµ are tuned offline and are the same across all Mu Jo Co and Gridworld experiments. With the Adam optimizer (Kingma & Ba, 2015), we search the learning rates in 2^-20, 2^-18, ..., 2^0 to minimize the loss on the offline data and use the learning rate 2^-10 on all learning processes. |