More Robust Doubly Robust Off-policy Evaluation
Authors: Mehrdad Farajtabar, Yinlam Chow, Mohammad Ghavamzadeh
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we evaluate MRDR in bandits and RL benchmark problems, and compare its performance with the existing methods. ... In this section, we demonstrate the effectiveness of the proposed MRDR estimation by comparing it with other state-of-the art methods from Section 3 on both contextual bandit and RL benchmark problems. |
| Researcher Affiliation | Collaboration | Mehrdad Farajtabar * 1 Yinlam Chow * 2 Mohammad Ghavamzadeh 2 ... 1Georgia Tech 2Deep Mind. Correspondence to: Yinlam Chow <yinlamchow@google.com>. |
| Pseudocode | No | The paper does not include any explicitly labeled pseudocode or algorithm blocks. Method steps are described in prose. |
| Open Source Code | No | The paper does not provide any specific links to source code repositories, nor does it explicitly state that code for the described methodology is available in supplementary materials or elsewhere. |
| Open Datasets | Yes | Using the 9 benchmark experiments described in Dud ık et al. (2011), we evaluate the OPE algorithms using the standard classification data-set from the UCI repository. |
| Dataset Splits | No | The paper describes the number of "training trajectories" and "trajectories for sampling-based part of estimators", but it does not specify explicit training/validation/test dataset splits (e.g., percentages or sample counts for distinct subsets of a dataset) needed for reproduction. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory, or cloud computing instances) used for running the experiments. |
| Software Dependencies | No | The paper mentions "standard RL algorithms such as SARSA and Q-learning" but does not specify any software names with version numbers for libraries, frameworks, or environments used in the experiments. |
| Experiment Setup | Yes | For both domains, the evaluation policy is constructed using ( , β) = (0.9, 0.05), and the behavior policy is constructed analogously using ( , β) = (0.8, 0.05). ... In the following experiments we set the discounting factor to be γ = 1. ... For both Model Fail and Model Win domains, the number of training trajectories is set to 64, for Maze, Mountain Car, and Cart Pole domains this number is set to 1024. The number of trajectories for sampling-based part of estimators varies from 32 to 512 for the Model Win, Model Fail, and Cart Pole domains, and varies from 128 to 2048 for the Maze and Mountain Car domains. |