Counterfactual Data-Fusion for Online Reinforcement Learners

Authors: Andrew Forney, Judea Pearl, Elias Bareinboim

ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We then demonstrate this new strategy in an enhanced Thompson Sampling bandit player, and support our findings efficacy with extensive simulations. We then develop a variant of the Thompson Sampling algorithm that implements this new heuristic, and run extensive simulations demonstrating its faster convergence rates compared to the current state-of-the-art (Sec. 5).
Researcher Affiliation Academia 1University of California, Los Angeles, California, USA 2Purdue University, West Lafayette, Indiana, USA.
Pseudocode No The paper describes the procedure for TSRDC agents in a numbered list within the text, but it is not presented as a formal pseudocode block or algorithm.
Open Source Code Yes 1Supplemental material. For paper appendices and other resources, visit: https://goo.gl/MYJWb Y
Open Datasets No The paper describes a simulated online learning environment (MABUC problem) rather than using a fixed, publicly available dataset with concrete access information for training. Simulations were performed on the 4-arm MABUC problem.
Dataset Splits No The paper describes an online learning and simulation setup, and does not specify explicit validation dataset splits. Performance is evaluated over rounds of simulation, not a static validation set.
Hardware Specification No The paper does not specify any hardware details (e.g., CPU, GPU models, memory, or cloud instances) used for running the simulations.
Software Dependencies No The paper mentions algorithms like 'Thompson Sampling (TS)' and 'RDC' but does not specify any software libraries or frameworks with version numbers that would be needed for replication.
Experiment Setup Yes Simulations were performed on the 4-arm MABUC problem, with results averaged across N = 1000 Monte Carlo repetitions, each T = 3000 rounds in duration. To illustrate the robustness of each proposed strategy, we performed simulations spanning across a wide range of payout parameterizations (see Appendix B for a complete report of experimental results). In brief, TSRDC agents perform the following at each round: (1) Observe the intent it from the current round s realization of UCs, ut. (2) Sample ˆEsamp[Yxr|it] from each arm s (xr) corresponding intent-specific beta distribution β(sxr,it, fxr,it) in which sxr,it is the number of successes (wins) and fxr,it is the number of failures (losses). (3) Compute each arm s it-specific score using the combined datasets via Strategy 3 (Eq. 9). (4) Choose the arm, xa, with the highest score computed in previous step. (5) Observe result (win / loss) and update ˆEsamp[Yxa|it].