reproducibilityindex.ai

Double Reinforcement Learning for Efficient and Robust Off-Policy Evaluation

Authors: Nathan Kallus, Masatoshi Uehara

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We now turn to an empirical study of OPE and DRL. We study comparative performance of different OPE estimators ˆρπe IS , ˆρπe DRL(NMDP), ˆρπe DM, ˆρπe MIS and ˆρπe DRL(MDP) in two standard Open AI Gym tasks: (Brockman et al., 2016): Cliff Walking and Mountain Car. [...] We report the RMSE of each estimator in each setting (and the standard error) in Tables 2 and 3.
Researcher Affiliation	Academia	1Cornell University, Ithaca, NY, USA 2Harvard University, Massachusetts, Boston, USA.
Pseudocode	No	The paper describes the steps for DRL for NMDPs and MDPs in numbered lists (e.g., 'DRL for NMDPs proceeds as follows: 1. Split the data...'), but these are not formatted as pseudocode or a clearly labeled algorithm block.
Open Source Code	No	The paper does not contain any statement about releasing source code for their methodology, nor does it provide a link to a repository.
Open Datasets	Yes	We study comparative performance of different OPE estimators ˆρπe IS , ˆρπe DRL(NMDP), ˆρπe DM, ˆρπe MIS and ˆρπe DRL(MDP) in two standard Open AI Gym tasks: (Brockman et al., 2016): Cliff Walking and Mountain Car.
Dataset Splits	No	The paper mentions 'varying evaluation dataset sizes' and training on 'training data Dj' within the DRL algorithm description, but it does not provide specific details on how the datasets were split into training, validation, and test sets (e.g., percentages, counts, or predefined splits).
Hardware Specification	No	The paper does not provide any specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies	No	The paper mentions 'scikit-learn' for feature generation and 'Open AI Gym' for tasks, but it does not provide specific version numbers for these or any other software dependencies.
Experiment Setup	Yes	First, we used q-learning to learn an optimal policy for the MDP and deﬁne it as πd. Then we generate the dataset from the behavior policy πb = (1 α)πd + απu where πu is a uniform random policy and α = 0.8. We deﬁne the target policy similarly but with α = 0.9. [...] For Cliff Walking, we use a histogram model... For Mountain-Car, we use the mode q(s, a; β) = β φ(s, a) where φ(s, a) is a 400-dimensional feature vector based on a radial basis function, generated using the RBFSampler method of scikit-learn...