Doubly Robust Off-policy Value Evaluation for Reinforcement Learning
Authors: Nan Jiang, Lihong Li
ICML 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the estimator s accuracy in several benchmark problems, and illustrate its use as a subroutine in safe policy improvement. 6. Experiments |
| Researcher Affiliation | Collaboration | Nan Jiang NANJIANG@UMICH.EDU Computer Science & Engineering, University of Michigan Lihong Li LIHONGLI@MICROSOFT.COM Microsoft Research |
| Pseudocode | No | The paper defines estimators using mathematical equations (e.g., Eqn. 10) but does not provide any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | No | The paper does not include an explicit statement about releasing its source code or provide any links to a code repository for the methodology described. |
| Open Datasets | Yes | In the last domain, we use the donation dataset from KDD Cup 1998 (Hettich & Bay, 1999) |
| Dataset Splits | Yes | We therefore split Deval further into two subsets Dreg and Dtest, estimate b Q from Dreg and apply DR on Dtest. we partition Deval into k subsets, apply Eqn.(8) to each subset with b Q estimated from the remaining data, and finally average the estimate over all subsets. we split |D| so that |Dtrain|/|D| {0.2, 0.4, 0.6, 0.8} |
| Hardware Specification | No | The paper does not provide specific hardware details (like CPU/GPU models, memory, or cloud instance types) used for running the experiments. |
| Software Dependencies | No | The paper does not specify any software dependencies with version numbers, such as programming languages, libraries, or frameworks used for implementation. |
| Experiment Setup | Yes | Model fitting We use state aggregations: the two state variables are multiplied by 26 and 28 respectively, and the rounded integers are treated as the abstract state. We then estimate an MDP model from data using a tabular approach. mix πtrain and π0 with rate α {0, 0.1, . . . , 0.9} |