Counterfactual Data-Fusion for Online Reinforcement Learners
Authors: Andrew Forney, Judea Pearl, Elias Bareinboim
ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We then demonstrate this new strategy in an enhanced Thompson Sampling bandit player, and support our findings efficacy with extensive simulations. We then develop a variant of the Thompson Sampling algorithm that implements this new heuristic, and run extensive simulations demonstrating its faster convergence rates compared to the current state-of-the-art (Sec. 5). |
| Researcher Affiliation | Academia | 1University of California, Los Angeles, California, USA 2Purdue University, West Lafayette, Indiana, USA. |
| Pseudocode | No | The paper describes the procedure for TSRDC agents in a numbered list within the text, but it is not presented as a formal pseudocode block or algorithm. |
| Open Source Code | Yes | 1Supplemental material. For paper appendices and other resources, visit: https://goo.gl/MYJWb Y |
| Open Datasets | No | The paper describes a simulated online learning environment (MABUC problem) rather than using a fixed, publicly available dataset with concrete access information for training. Simulations were performed on the 4-arm MABUC problem. |
| Dataset Splits | No | The paper describes an online learning and simulation setup, and does not specify explicit validation dataset splits. Performance is evaluated over rounds of simulation, not a static validation set. |
| Hardware Specification | No | The paper does not specify any hardware details (e.g., CPU, GPU models, memory, or cloud instances) used for running the simulations. |
| Software Dependencies | No | The paper mentions algorithms like 'Thompson Sampling (TS)' and 'RDC' but does not specify any software libraries or frameworks with version numbers that would be needed for replication. |
| Experiment Setup | Yes | Simulations were performed on the 4-arm MABUC problem, with results averaged across N = 1000 Monte Carlo repetitions, each T = 3000 rounds in duration. To illustrate the robustness of each proposed strategy, we performed simulations spanning across a wide range of payout parameterizations (see Appendix B for a complete report of experimental results). In brief, TSRDC agents perform the following at each round: (1) Observe the intent it from the current round s realization of UCs, ut. (2) Sample ˆEsamp[Yxr|it] from each arm s (xr) corresponding intent-specific beta distribution β(sxr,it, fxr,it) in which sxr,it is the number of successes (wins) and fxr,it is the number of failures (losses). (3) Compute each arm s it-specific score using the combined datasets via Strategy 3 (Eq. 9). (4) Choose the arm, xa, with the highest score computed in previous step. (5) Observe result (win / loss) and update ˆEsamp[Yxa|it]. |