Online Reinforcement Learning for Mixed Policy Scopes

Authors: Junzhe Zhang, Elias Bareinboim

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 Simulations In this section, we evaluate the performance of our algorithms on randomly generated SCMs in various types of causal diagrams. After all, our algorithms can consistently find the corresponding optimal policies with mixed scopes. Further leveraging causal relationships in the underlying environment accelerates the convergence rate of online learners. In all experiments, we evaluate the novel CAUSAL-TS*, with uninformative Dirichlet priors over exogenous probabilities and uniform priors over structural functions, which we label as c-ts*. As a baseline, we also include randomized trials (rct) allocating treatments in all possible scopes uniformly at random, standard Thompson sampling algorithm (ts) using all deterministic policies as arms, and Thompson sampling over a simplified mixed scope (ts*), which is obtained by applying graphical conditions in [18]. For all algorithms, we measure their cumulative regrets over T = 1.1 103 episodes. We refer readers to the technical report [42, Appendix B] for a more detailed discussion on the experimental set-up. Figure 3: Simulations comparing online learners that are randomized (rct), adaptive (ts), adaptive with simplified policy scopes (ts*), and causally enhances (c-ts*). x-axle represents total episodes and y-axle for the cumulative regret.
Researcher Affiliation Academia Junzhe Zhang Causal AI Lab Columbia University junzhez@cs.columbia.edu Elias Bareinboim Causal AI Lab Columbia University eb@cs.columbia.edu
Pseudocode Yes Algorithm 1 CAUSAL-UCB* Input: Causal diagram G, policy space ΠS, failure tolerance δ (0, 1). Algorithm 2 MINCOLLECT Input: Causal diagram G, c-collection C. Algorithm 3 CAUSAL-TS* 1: Input: Causal diagram G, policy space ΠS, prior ρ.
Open Source Code No The paper does not provide any statement or link indicating that the source code for the methodology is open-source or publicly available.
Open Datasets No The paper states, "We randomly generate 100 instances of SCMs in Fig. 1a with binary X1, X2, Z, W, Y {0, 1}." and refers to "randomly generated SCMs" throughout the simulations section. This indicates synthetic data, not a publicly accessible dataset with concrete access information or citation.
Dataset Splits No The paper focuses on online reinforcement learning and measures cumulative regret over T episodes using randomly generated SCMs. It does not mention traditional train/validation/test splits, which are typically found in static supervised learning contexts.
Hardware Specification No The paper does not mention any specific hardware used for running the experiments (e.g., GPU/CPU models, memory specifications, or cloud instances).
Software Dependencies No The paper does not specify any software names with version numbers, such as programming languages, libraries, or frameworks used in the implementation or experiments.
Experiment Setup Yes In all experiments, we evaluate the novel CAUSAL-TS*, with uninformative Dirichlet priors over exogenous probabilities and uniform priors over structural functions, which we label as c-ts*. As a baseline, we also include randomized trials (rct) allocating treatments in all possible scopes uniformly at random, standard Thompson sampling algorithm (ts) using all deterministic policies as arms, and Thompson sampling over a simplified mixed scope (ts*), which is obtained by applying graphical conditions in [18]. For all algorithms, we measure their cumulative regrets over T = 1.1 103 episodes. We randomly generate 100 instances of SCMs in Fig. 1a with binary X1, X2, Z, W, Y {0, 1}. Exogenous variables U1, U2 are discrete, taking values in finite domains with cardinalities d1 = 16, d2 = 48 respectively.