Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning

Authors: Nathan Kallus, Masatoshi Uehara

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Besides the theoretical guarantees, empirical studies suggest the new estimators provide advantages.
Researcher Affiliation Academia Nathan Kallus Cornell University New York, NY kallus@cornell.edu Masatoshi Uehara Harvard University Cambrdige, MA uehara_m@g.harvard.edu
Pseudocode No The paper describes algorithms and derivations in text and mathematical formulas but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statement about releasing source code or a link to a code repository.
Open Datasets Yes We evaluate the OPE algorithms using the standard classification data-sets from the UCI repository. Here, we follow the same procedure of transforming a classification data-set into a contextual bandit data set as in [5, 6]. ... We next compare the OPE algorithms in three standard RL setting from Open AI Gym [3]: Windy Grid World, Cliff Walking, and Mountain Car.
Dataset Splits Yes We first split the data into training and evaluation. ... We again split the data into training and evaluation.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments (e.g., CPU/GPU models, memory).
Software Dependencies No The paper mentions methods like 'logistic regression', 'Q-learning', and 'off-policy TD learning', and refers to 'Open AI Gym', but it does not specify any version numbers for these software components or libraries.
Experiment Setup Yes The resulting estimation RMSEs (root mean square error) over 200 replications of each experiment are given in Tables 2–4, where we highlight in bold the best two methods in each case. We again split the data into training and evaluation. ... We set the discounting factor to be 1.0 as in [6].