reproducibilityindex.ai

More Robust Doubly Robust Off-policy Evaluation

Authors: Mehrdad Farajtabar, Yinlam Chow, Mohammad Ghavamzadeh

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, we evaluate MRDR in bandits and RL benchmark problems, and compare its performance with the existing methods. ... In this section, we demonstrate the effectiveness of the proposed MRDR estimation by comparing it with other state-of-the art methods from Section 3 on both contextual bandit and RL benchmark problems.
Researcher Affiliation	Collaboration	Mehrdad Farajtabar * 1 Yinlam Chow * 2 Mohammad Ghavamzadeh 2 ... 1Georgia Tech 2Deep Mind. Correspondence to: Yinlam Chow <yinlamchow@google.com>.
Pseudocode	No	The paper does not include any explicitly labeled pseudocode or algorithm blocks. Method steps are described in prose.
Open Source Code	No	The paper does not provide any specific links to source code repositories, nor does it explicitly state that code for the described methodology is available in supplementary materials or elsewhere.
Open Datasets	Yes	Using the 9 benchmark experiments described in Dud ık et al. (2011), we evaluate the OPE algorithms using the standard classiﬁcation data-set from the UCI repository.
Dataset Splits	No	The paper describes the number of "training trajectories" and "trajectories for sampling-based part of estimators", but it does not specify explicit training/validation/test dataset splits (e.g., percentages or sample counts for distinct subsets of a dataset) needed for reproduction.
Hardware Specification	No	The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory, or cloud computing instances) used for running the experiments.
Software Dependencies	No	The paper mentions "standard RL algorithms such as SARSA and Q-learning" but does not specify any software names with version numbers for libraries, frameworks, or environments used in the experiments.
Experiment Setup	Yes	For both domains, the evaluation policy is constructed using ( , β) = (0.9, 0.05), and the behavior policy is constructed analogously using ( , β) = (0.8, 0.05). ... In the following experiments we set the discounting factor to be γ = 1. ... For both Model Fail and Model Win domains, the number of training trajectories is set to 64, for Maze, Mountain Car, and Cart Pole domains this number is set to 1024. The number of trajectories for sampling-based part of estimators varies from 32 to 512 for the Model Win, Model Fail, and Cart Pole domains, and varies from 128 to 2048 for the Maze and Mountain Car domains.