Double Reinforcement Learning for Efficient and Robust Off-Policy Evaluation

Authors: Nathan Kallus, Masatoshi Uehara

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We now turn to an empirical study of OPE and DRL. We study comparative performance of different OPE estimators ˆρπe IS , ˆρπe DRL(NMDP), ˆρπe DM, ˆρπe MIS and ˆρπe DRL(MDP) in two standard Open AI Gym tasks: (Brockman et al., 2016): Cliff Walking and Mountain Car. [...] We report the RMSE of each estimator in each setting (and the standard error) in Tables 2 and 3.
Researcher Affiliation Academia 1Cornell University, Ithaca, NY, USA 2Harvard University, Massachusetts, Boston, USA.
Pseudocode No The paper describes the steps for DRL for NMDPs and MDPs in numbered lists (e.g., 'DRL for NMDPs proceeds as follows: 1. Split the data...'), but these are not formatted as pseudocode or a clearly labeled algorithm block.
Open Source Code No The paper does not contain any statement about releasing source code for their methodology, nor does it provide a link to a repository.
Open Datasets Yes We study comparative performance of different OPE estimators ˆρπe IS , ˆρπe DRL(NMDP), ˆρπe DM, ˆρπe MIS and ˆρπe DRL(MDP) in two standard Open AI Gym tasks: (Brockman et al., 2016): Cliff Walking and Mountain Car.
Dataset Splits No The paper mentions 'varying evaluation dataset sizes' and training on 'training data Dj' within the DRL algorithm description, but it does not provide specific details on how the datasets were split into training, validation, and test sets (e.g., percentages, counts, or predefined splits).
Hardware Specification No The paper does not provide any specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions 'scikit-learn' for feature generation and 'Open AI Gym' for tasks, but it does not provide specific version numbers for these or any other software dependencies.
Experiment Setup Yes First, we used q-learning to learn an optimal policy for the MDP and define it as πd. Then we generate the dataset from the behavior policy πb = (1 α)πd + απu where πu is a uniform random policy and α = 0.8. We define the target policy similarly but with α = 0.9. [...] For Cliff Walking, we use a histogram model... For Mountain-Car, we use the mode q(s, a; β) = β φ(s, a) where φ(s, a) is a 400-dimensional feature vector based on a radial basis function, generated using the RBFSampler method of scikit-learn...