Provably Efficient Causal Reinforcement Learning with Confounded Observational Data

Authors: Lingxiao Wang, Zhuoran Yang, Zhaoran Wang

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical To tackle such challenges, we propose the deconfounded optimistic value iteration (DOVI) algorithm, which incorporates the confounded observational data in a provably efficient manner. DOVI explicitly adjusts for the confounding bias in the observational data, where the confounders are partially observed or unobserved. In both cases, such adjustments allow us to construct the bonus based on a notion of information gain, which takes into account the amount of information acquired from the offline setting. In particular, we prove that the regret of DOVI is smaller than the optimal regret achievable in the pure online setting when the confounded observational data are informative upon the adjustments.
Researcher Affiliation Academia Lingxiao Wang Northwestern University lwang@u.northwestern.edu Zhuoran Yang Princeton University zy6@princeton.edu Zhaoran Wang Northwestern University zhaoranwang@gmail.com
Pseudocode Yes Algorithm 1 Deconfounded Optimistic Value Iteration (DOVI) for Confounded MDP
Open Source Code No The paper does not provide any information about open-source code for the described methodology.
Open Datasets No The paper is theoretical and does not specify or provide access information for a publicly available or open dataset for training purposes.
Dataset Splits No The paper is theoretical and does not specify training/validation/test dataset splits.
Hardware Specification No The paper is theoretical and does not provide hardware specifications used for experiments.
Software Dependencies No The paper is theoretical and does not provide specific software dependencies with version numbers.
Experiment Setup No The paper is theoretical and does not describe an experimental setup or specific hyperparameters.