Bandits with Unobserved Confounders: A Causal Approach
Authors: Elias Bareinboim, Andrew Forney, Judea Pearl
NeurIPS 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empower Thompson Sampling with this new principle and run extensive simulations. The experiments suggest that the new strategy is stats. efficient and consistent (Sec. 4). 4 Applications & Experiments In the next section, we provide simulations to support the efficacy of TSC in the MABUC context. Experiment 1 employs the Greedy Casino parameterization... Each experiment compares the performance of traditional Thompson Sampling bandit players versus TSC. All reported simulations are partitioned into rounds of T = 1000 trials averaged over N = 1000 Monte Carlo repetitions. |
| Researcher Affiliation | Academia | Elias Bareinboim Department of Computer Science Purdue University eb@purdue.edu Andrew Forney Department of Computer Science University of California, Los Angeles forns@cs.ucla.edu Judea Pearl Department of Computer Science University of California, Los Angeles judea@cs.ucla.edu |
| Pseudocode | Yes | Algorithm 1 Causal Thompson Sampling (TSC) 1: procedure TSC(Pobs, T) 2: E(YX=a|X) Pobs(y|X) (seed distribution) 3: for t = [1, ..., T] do 4: x intuition(t) (get intuition for trial) 5: Q1 E(YX=x |X = x) (estimated payout for counter-intuition) 6: Q2 P(y|X = x) (estimated payout for intuition) 7: w [1, 1] (initialize weights) 8: bias 1 |Q1 Q2| (compute weighting strength) 9: if Q1 > Q2 then w[x] bias else w[x ] bias (choose arm to bias) 10: a max(β(s M1,x, f M1,x) w[1], β(s M2,x, f M2,x) w[2]) (choose arm) 6 11: y pull(a) (receive reward) 12: E(YX=a|X = x) y|a, x (update) |
| Open Source Code | Yes | (For simulation source code, see https://github.com/ucla-csl/mabuc ) |
| Open Datasets | No | The paper describes two synthetic scenarios with payout rates provided in tables (Table 1 and Table 2), which define the data generation process for the simulations. It does not refer to a pre-existing, publicly available dataset with a specific link, DOI, or repository for direct download. |
| Dataset Splits | No | The paper conducts simulations in a sequential decision-making setting, where algorithms learn online. It describes the number of trials and Monte Carlo repetitions, but it does not specify traditional train/validation/test dataset splits as one would for a static dataset. |
| Hardware Specification | No | The paper describes simulations but does not provide specific hardware details (such as CPU/GPU models, memory, or specific computing environments) used to run these experiments. |
| Software Dependencies | No | The paper mentions algorithms like Thompson Sampling and implies the use of Python via the GitHub link, but it does not specify any software dependencies with version numbers (e.g., library names with specific versions). |
| Experiment Setup | Yes | All reported simulations are partitioned into rounds of T = 1000 trials averaged over N = 1000 Monte Carlo repetitions. At each time step in a single round, (1) values for the unobserved confounders (B, D) and observed intuitive arm choice (x) are selected by their respective structural equations (see Section 2), (2) the player observes the value of x, (3) the player chooses an arm based on their given strategy to maximize reward (which may or may not employ x), and finally, (4) the player receives a Bernoulli reward Y {0, 1} and records the outcome. Furthermore, at the start of every round, players possess knowledge of the problem’s observational distribution, i.e., each player begins knowing P(Y |X) (see Table 2b). The notation: β(s Mk,x, f Mk,x) means to sample from a Beta distribution with parameters equal to the successes encountered choosing action x on machine Mk(s Mk,x) and the failures encountered choosing action x on that machine (f Mk,x). |