More Efficient Off-Policy Evaluation through Regularized Targeted Learning
Authors: Aurelien Bibaut, Ivana Malenica, Nikos Vlassis, Mark Van Der Laan
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show empirically that our estimator uniformly wins over existing off-policy evaluation methods across multiple RL environments and various levels of model misspecification. In this section, we demonstrate the effectiveness of RLTMLE by comparing it with other state-of-the-art methods used for OPE problem in various RL benchmark environments. |
| Researcher Affiliation | Collaboration | 1University of California, Berkeley, CA 2Netflix, Los Gatos, CA. |
| Pseudocode | Yes | We present the pseudo-code of the procedure as Algorithm 1. Because of space limitation, we only give a pseudo-code description of RLTMLE 2, which is our most performant algorithm, as we will see in the next section. |
| Open Source Code | No | The paper does not contain any statement about releasing source code for the methodology, nor does it provide a link to a code repository. |
| Open Datasets | No | The paper mentions using well-known RL benchmark environments like 'Grid World', 'Model Fail', and 'Model Win' and states 'We implement the same behavior and evaluation policies as in previous work (Thomas & Brunskill, 2016; Farajtabar et al., 2018).', but it does not provide concrete access information (specific links, DOIs, repositories, or formal citations including authors and year for the datasets themselves) to these datasets. |
| Dataset Splits | No | The paper describes an internal sample splitting (D(0) and D(1)) for the algorithm's operation, but it does not specify explicit train/validation/test dataset splits (e.g., percentages or counts) for evaluating the model's overall performance on unseen data for reproducibility. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper does not specify any software dependencies with version numbers (e.g., Python, PyTorch, or specific libraries). |
| Experiment Setup | Yes | In addition, we test sensitivity to the number of episodes in D with n = {100, 200, 500, 1000) for Grid World and Model Fail, and n = {100, 500, 1000, 5000, 10000) for Model Win. We start with small amount of bias, b0 = 0.005 Normal(0, 1)... Consequently, we increase model misspecification to b0 = 0.05 Normal(0, 1). |