Confounding-Robust Policy Evaluation in Infinite-Horizon Reinforcement Learning
Authors: Nathan Kallus, Angela Zhou
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We develop approximations based on nonconvex projected gradient descent and demonstrate the resulting bounds empirically. We then demonstrate the approach on a gridworld task with unobserved confounding. In Figure 5, we study the finite-sample properties of the bounds estimator, plotting ^RT e ^R10k e for differing trajectory lengths on a logarithmic grid, T 2 [250, 10000], and standard errors averaged over 50 replications. |
| Researcher Affiliation | Academia | Nathan Kallus School of Operations Research and Information Engineering Cornell University and Cornell Tech kallus@cornell.edu Angela Zhou School of Operations Research and Information Engineering Cornell University and Cornell Tech az434@cornell.edu |
| Pseudocode | Yes | Algorithm 1 Nonconvex nonconvex-projected gradient descent |
| Open Source Code | No | The paper does not provide an explicit statement about releasing source code or a link to a code repository for the described methodology. |
| Open Datasets | No | The paper describes custom-built simulation environments ('Confounded random walk', '3x3 confounded windy gridworld') but does not refer to or provide access to any publicly available datasets. |
| Dataset Splits | No | The paper describes simulation parameters like 'trajectory lengths' and 'replications' but does not specify conventional training, validation, or test dataset splits for a pre-existing dataset. |
| Hardware Specification | No | The paper mentions using 'Gurobi version 9' but does not provide any specific hardware details such as CPU, GPU models, or memory specifications used for the experiments. |
| Software Dependencies | Yes | We compute bounds via global optimization with Gurobi version 9 |
| Experiment Setup | Yes | In Fig. 3, we vary the underlying transition model, varying pu1 = pu2 on a grid [0.1, 0.45], and we plot the varying bounds with action-marginal control variates. The true underlying behavior policy takes action a = 1 with probability (1 | s1, u1) = (1 | s2, u1) = 1/4 (and the complementary probability when u = u2). In Figure 5, we study the finite-sample properties of the bounds estimator, plotting ^RT e ^R10k e for differing trajectory lengths on a logarithmic grid, T 2 [250, 10000], and standard errors averaged over 50 replications. |