More Efficient Off-Policy Evaluation through Regularized Targeted Learning

Authors: Aurelien Bibaut, Ivana Malenica, Nikos Vlassis, Mark Van Der Laan

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show empirically that our estimator uniformly wins over existing off-policy evaluation methods across multiple RL environments and various levels of model misspecification. In this section, we demonstrate the effectiveness of RLTMLE by comparing it with other state-of-the-art methods used for OPE problem in various RL benchmark environments.
Researcher Affiliation Collaboration 1University of California, Berkeley, CA 2Netflix, Los Gatos, CA.
Pseudocode Yes We present the pseudo-code of the procedure as Algorithm 1. Because of space limitation, we only give a pseudo-code description of RLTMLE 2, which is our most performant algorithm, as we will see in the next section.
Open Source Code No The paper does not contain any statement about releasing source code for the methodology, nor does it provide a link to a code repository.
Open Datasets No The paper mentions using well-known RL benchmark environments like 'Grid World', 'Model Fail', and 'Model Win' and states 'We implement the same behavior and evaluation policies as in previous work (Thomas & Brunskill, 2016; Farajtabar et al., 2018).', but it does not provide concrete access information (specific links, DOIs, repositories, or formal citations including authors and year for the datasets themselves) to these datasets.
Dataset Splits No The paper describes an internal sample splitting (D(0) and D(1)) for the algorithm's operation, but it does not specify explicit train/validation/test dataset splits (e.g., percentages or counts) for evaluating the model's overall performance on unseen data for reproducibility.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The paper does not specify any software dependencies with version numbers (e.g., Python, PyTorch, or specific libraries).
Experiment Setup Yes In addition, we test sensitivity to the number of episodes in D with n = {100, 200, 500, 1000) for Grid World and Model Fail, and n = {100, 500, 1000, 5000, 10000) for Model Win. We start with small amount of bias, b0 = 0.005 Normal(0, 1)... Consequently, we increase model misspecification to b0 = 0.05 Normal(0, 1).