Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation
Authors: Yaqi Duan, Zeyu Jia, Mengdi Wang
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | This paper studies the statistical theory of offpolicy policy evaluation with function approximation in batch data reinforcement learning problem. We consider a regression-based fitted Q iteration method, and show that it is equivalent to a modelbased method that estimates a conditional mean embedding of the transition operator. We prove that this method is information-theoretically optimal and has nearly minimal estimation error. In particular, by leveraging contraction property of Markov processes and martingale concentration, we establish a finite-sample instance-dependent error upper bound and a nearly-matching minimax lower bound. |
| Researcher Affiliation | Collaboration | Yaqi Duan 1 Zeyu Jia 2 Mengdi Wang 3 4 1Department of Operations Research and Financial Engineering, Princeton University, NJ, USA 2School of Mathematics, Peking University, Beijing, China 3Department of Operations Research and Financial Engineering, Princeton University, NJ, USA 4Deep Mind, London, UK. |
| Pseudocode | Yes | Algorithm 1 Fitted Q-iteration for Off-Policy Evaluation (FQI-OPE) and Algorithm 2 Conditional Mean Embedding for Policy Evaluation (CME-PE) are provided. |
| Open Source Code | No | The paper does not provide any specific statements or links regarding the availability of open-source code for the described methodology. |
| Open Datasets | No | The paper focuses on theoretical analysis and does not mention or provide access information for any specific publicly available datasets used for training or evaluation. |
| Dataset Splits | No | The paper is theoretical and does not describe empirical experiments, therefore it does not provide specific dataset split information for training, validation, or testing. |
| Hardware Specification | No | The paper is theoretical and does not describe empirical experiments, therefore it does not provide specific hardware details used for running experiments. |
| Software Dependencies | No | The paper is theoretical and does not describe empirical experiments, therefore it does not provide specific ancillary software details with version numbers. |
| Experiment Setup | No | The paper is theoretical and focuses on mathematical proofs and algorithms, without describing any specific experimental setup details such as hyperparameters or training configurations. |