Non-asymptotic Confidence Intervals of Off-policy Evaluation: Primal and Dual Bounds
Authors: Yihao Feng, Ziyang Tang, na zhang, qiang liu
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present empirical results that clearly demonstrate the advantages of our approach over existing methods. (...) We present our main approach in Section 4 and perform empirical studies in Section 5. |
| Researcher Affiliation | Academia | Yihao Feng *, Ziyang Tang University of Texas at Austin {yihao, ztang}@cs.utexas.edu Na Zhang Tsinghua University zhangna@pbcsf.tsinghua.edu.cn Qiang Liu University of Texas at Austin lqiang@cs.utexas.edu |
| Pseudocode | No | The paper does not contain any explicit pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any statement or link indicating that its source code is publicly available. |
| Open Datasets | Yes | We test our method on three environments: Inverted Pendulum and Cart Pole from Open AI Gym (Brockman et al., 2016), and a Type-1 Diabetes medical treatment simulator.1 (...) 1 https://github.com/jxx123/simglucose. |
| Dataset Splits | Yes | The bandwidth of k and k are selected to make sure the function Bellman loss is not large on a validation set. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for experiments (e.g., GPU/CPU models, memory). |
| Software Dependencies | No | The paper mentions using 'PPO (Schulman et al., 2017)' as a policy training method but does not specify software dependencies with version numbers (e.g., Python, PyTorch, or specific library versions). |
| Experiment Setup | Yes | For horizon lengths, We fix γ = 0.95 and set horizon length H = 50 for Inverted-Pendulum, H = 100 for Cart Pole, and H = 50 for Diabetes simulator. (...) We take both kernels to be Gaussian RBF kernel and choose r Q and the bandwidths of k and k using the procedure in Appendix H.2. We use a fast approximation method to optimize ω in F + Q(ω) and F Q (ω) as shown in Appendix D. |