Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic Policies

Authors: Nathan Kallus, Masatoshi Uehara

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5 Experiments We next conduct an experiment in a very simple environment to confirm the theoretical guarantees of the proposed estimators. More extensive experimentation remains future work. The setting is as follows. Set St = R, At = R, s0 = 0. Then, set the transition dynamics as st = at 1 st 1 + N(0, 0.32), the reward as rt = s2 t, the behavior policy as πb(a | s) = N(0.8s, 1.02), the deterministic evaluation policy as τt(st) = θst, and the horizon as H = 20. Note that in this setting, the optimal policy is given by θ = 1 . We compare CPGK, CPGD, MPGK, MPGD using the Gaussian kernel with PG. The nuisance functions q, w, dq, dw (and their case K equivalents) are estimated using polynomial sieve regressions (Chen, 2007). We assume the behavior policy is known. Since q is estimated by polynomials and k is Gaussian, we can compute the integrals in MPGK and CPGK analytically. We use the same estimated q in PG. We choose h by bootstrapping the estimator for each of h {0.05, 0.1, 0.25, 0.5} and choosing that with smallest bootstrap variance. First, in Fig. 1, we compare the MSE of gradient estimators at θ = 1.0 over 100 replications for each of n = 200, 400, 600, 800. We find that the performance of MPGK is far superior to all other estimators in terms of MSE, which confirms our theoretical results. Interestingly, the performance of MPGD is slightly worse than CPGD. The possible reason is it is more difficult to estimate w than w K. The reasonably good performance of CDGD and CDGK can be attributed to the known λD t , λK t , which ensures less sensitivity to the q-estimation due to the doubly robust error structure. Second, in Fig. 2, we apply gradient ascent (see Appendix C) with αt = 0.05, T = 50, and ˆθ1 randomly chosen from [0.8, 1.2]. We only run the bootstrap for ˆθ1 and then keep the same h for the next iterations. We compare the regret of the final policy for the different policy gradient estimators, i.e., J(θ ) J(ˆθ50), averaging over 100 replications of the experiment for each of n = 200, 400, 600, 800. Again, the performance of MPGK is superior to other estimators also in terms of regret.
Researcher Affiliation Academia Nathan Kallus, Masatoshi Uehara Cornell University and Cornell Tech New York, NY kallus@cornell.edu, mu223@cornell.edu
Pseudocode Yes A simple gradient ascent is given as an example in Appendix C and used in the experiments in the next section.
Open Source Code No The paper does not provide an explicit statement or link for open-source code for the methodology described.
Open Datasets No The paper uses a simulated environment described with specific parameters (e.g., transition dynamics as st = at 1 st 1 + N(0, 0.32)), rather than a publicly available dataset.
Dataset Splits No The paper conducts experiments in a simulated environment using replications and bootstrapping for evaluation and parameter selection (e.g., 'averaging over 100 replications', 'bootstrapping the estimator'), rather than defining explicit training, validation, and test dataset splits in the conventional sense.
Hardware Specification No The paper does not provide specific hardware details (such as exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies No The paper mentions software components like 'polynomial sieve regressions' and 'Gaussian kernel' but does not provide specific version numbers for any software dependencies or libraries.
Experiment Setup Yes We apply gradient ascent (see Appendix C) with αt = 0.05, T = 50, and ˆθ1 randomly chosen from [0.8, 1.2].