Doubly Robust Off-policy Value Evaluation for Reinforcement Learning

Authors: Nan Jiang, Lihong Li

ICML 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the estimator s accuracy in several benchmark problems, and illustrate its use as a subroutine in safe policy improvement. 6. Experiments
Researcher Affiliation Collaboration Nan Jiang NANJIANG@UMICH.EDU Computer Science & Engineering, University of Michigan Lihong Li LIHONGLI@MICROSOFT.COM Microsoft Research
Pseudocode No The paper defines estimators using mathematical equations (e.g., Eqn. 10) but does not provide any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code No The paper does not include an explicit statement about releasing its source code or provide any links to a code repository for the methodology described.
Open Datasets Yes In the last domain, we use the donation dataset from KDD Cup 1998 (Hettich & Bay, 1999)
Dataset Splits Yes We therefore split Deval further into two subsets Dreg and Dtest, estimate b Q from Dreg and apply DR on Dtest. we partition Deval into k subsets, apply Eqn.(8) to each subset with b Q estimated from the remaining data, and finally average the estimate over all subsets. we split |D| so that |Dtrain|/|D| {0.2, 0.4, 0.6, 0.8}
Hardware Specification No The paper does not provide specific hardware details (like CPU/GPU models, memory, or cloud instance types) used for running the experiments.
Software Dependencies No The paper does not specify any software dependencies with version numbers, such as programming languages, libraries, or frameworks used for implementation.
Experiment Setup Yes Model fitting We use state aggregations: the two state variables are multiplied by 26 and 28 respectively, and the rounded integers are treated as the abstract state. We then estimate an MDP model from data using a tabular approach. mix πtrain and π0 with rate α {0, 0.1, . . . , 0.9}