reproducibilityindex.ai

Counterfactual Data-Fusion for Online Reinforcement Learners

Authors: Andrew Forney, Judea Pearl, Elias Bareinboim

ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We then demonstrate this new strategy in an enhanced Thompson Sampling bandit player, and support our ﬁndings efﬁcacy with extensive simulations. We then develop a variant of the Thompson Sampling algorithm that implements this new heuristic, and run extensive simulations demonstrating its faster convergence rates compared to the current state-of-the-art (Sec. 5).
Researcher Affiliation	Academia	1University of California, Los Angeles, California, USA 2Purdue University, West Lafayette, Indiana, USA.
Pseudocode	No	The paper describes the procedure for TSRDC agents in a numbered list within the text, but it is not presented as a formal pseudocode block or algorithm.
Open Source Code	Yes	1Supplemental material. For paper appendices and other resources, visit: https://goo.gl/MYJWb Y
Open Datasets	No	The paper describes a simulated online learning environment (MABUC problem) rather than using a fixed, publicly available dataset with concrete access information for training. Simulations were performed on the 4-arm MABUC problem.
Dataset Splits	No	The paper describes an online learning and simulation setup, and does not specify explicit validation dataset splits. Performance is evaluated over rounds of simulation, not a static validation set.
Hardware Specification	No	The paper does not specify any hardware details (e.g., CPU, GPU models, memory, or cloud instances) used for running the simulations.
Software Dependencies	No	The paper mentions algorithms like 'Thompson Sampling (TS)' and 'RDC' but does not specify any software libraries or frameworks with version numbers that would be needed for replication.
Experiment Setup	Yes	Simulations were performed on the 4-arm MABUC problem, with results averaged across N = 1000 Monte Carlo repetitions, each T = 3000 rounds in duration. To illustrate the robustness of each proposed strategy, we performed simulations spanning across a wide range of payout parameterizations (see Appendix B for a complete report of experimental results). In brief, TSRDC agents perform the following at each round: (1) Observe the intent it from the current round s realization of UCs, ut. (2) Sample ˆEsamp[Yxr\|it] from each arm s (xr) corresponding intent-speciﬁc beta distribution β(sxr,it, fxr,it) in which sxr,it is the number of successes (wins) and fxr,it is the number of failures (losses). (3) Compute each arm s it-speciﬁc score using the combined datasets via Strategy 3 (Eq. 9). (4) Choose the arm, xa, with the highest score computed in previous step. (5) Observe result (win / loss) and update ˆEsamp[Yxa\|it].