Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning
Authors: Nathan Kallus, Masatoshi Uehara
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Besides the theoretical guarantees, empirical studies suggest the new estimators provide advantages. |
| Researcher Affiliation | Academia | Nathan Kallus Cornell University New York, NY kallus@cornell.edu Masatoshi Uehara Harvard University Cambrdige, MA uehara_m@g.harvard.edu |
| Pseudocode | No | The paper describes algorithms and derivations in text and mathematical formulas but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any explicit statement about releasing source code or a link to a code repository. |
| Open Datasets | Yes | We evaluate the OPE algorithms using the standard classification data-sets from the UCI repository. Here, we follow the same procedure of transforming a classification data-set into a contextual bandit data set as in [5, 6]. ... We next compare the OPE algorithms in three standard RL setting from Open AI Gym [3]: Windy Grid World, Cliff Walking, and Mountain Car. |
| Dataset Splits | Yes | We first split the data into training and evaluation. ... We again split the data into training and evaluation. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments (e.g., CPU/GPU models, memory). |
| Software Dependencies | No | The paper mentions methods like 'logistic regression', 'Q-learning', and 'off-policy TD learning', and refers to 'Open AI Gym', but it does not specify any version numbers for these software components or libraries. |
| Experiment Setup | Yes | The resulting estimation RMSEs (root mean square error) over 200 replications of each experiment are given in Tables 2–4, where we highlight in bold the best two methods in each case. We again split the data into training and evaluation. ... We set the discounting factor to be 1.0 as in [6]. |