Provably Efficient Causal Reinforcement Learning with Confounded Observational Data
Authors: Lingxiao Wang, Zhuoran Yang, Zhaoran Wang
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | To tackle such challenges, we propose the deconfounded optimistic value iteration (DOVI) algorithm, which incorporates the confounded observational data in a provably efficient manner. DOVI explicitly adjusts for the confounding bias in the observational data, where the confounders are partially observed or unobserved. In both cases, such adjustments allow us to construct the bonus based on a notion of information gain, which takes into account the amount of information acquired from the offline setting. In particular, we prove that the regret of DOVI is smaller than the optimal regret achievable in the pure online setting when the confounded observational data are informative upon the adjustments. |
| Researcher Affiliation | Academia | Lingxiao Wang Northwestern University lwang@u.northwestern.edu Zhuoran Yang Princeton University zy6@princeton.edu Zhaoran Wang Northwestern University zhaoranwang@gmail.com |
| Pseudocode | Yes | Algorithm 1 Deconfounded Optimistic Value Iteration (DOVI) for Confounded MDP |
| Open Source Code | No | The paper does not provide any information about open-source code for the described methodology. |
| Open Datasets | No | The paper is theoretical and does not specify or provide access information for a publicly available or open dataset for training purposes. |
| Dataset Splits | No | The paper is theoretical and does not specify training/validation/test dataset splits. |
| Hardware Specification | No | The paper is theoretical and does not provide hardware specifications used for experiments. |
| Software Dependencies | No | The paper is theoretical and does not provide specific software dependencies with version numbers. |
| Experiment Setup | No | The paper is theoretical and does not describe an experimental setup or specific hyperparameters. |