reproducibilityindex.ai

Coordinated Exploration in Concurrent Reinforcement Learning

Authors: Maria Dimakopoulou, Benjamin Van Roy

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we present computational results that demonstrate the robustness of seed sampling algorithms of Section 3.2 versus the baseline algorithms of Section 3.1. In sections 4.1 and 4.2, we present two simple problems that highlight the weaknesses of concurrent UCRL and Thompson resampling and demonstrate how severely performance may suffer due to violation of any among Properties 1, 2, 3. In Section 4.3, we demonstrate the relative efﬁciency of seed sampling in a more complex problem.
Researcher Affiliation	Academia	Maria Dimakopoulou 1 Benjamin Van Roy 1 1Stanford University, California, USA. Correspondence to: Maria Dimakopoulou <madima@stanford.edu>, Benjamin Van Roy <bvr@stanford.edu>.
Pseudocode	No	The paper describes the algorithms and their mathematical formulations in prose, but does not provide structured pseudocode or algorithm blocks.
Open Source Code	No	The paper provides a link to a demo video (https://youtu.be/xjGK-wm0PkI) but does not provide concrete access to the source code for the methodology described.
Open Datasets	No	The paper defines problem specifications like 'Bipolar Chain', 'Parallel Chains', and 'Maximum Reward Path' and describes how data for these scenarios is generated (e.g., 'we sample Erd os-R enyi graphs'), rather than referencing established public datasets with access information.
Dataset Splits	No	The paper describes simulated environments and experiments but does not explicitly provide details about train, validation, or test dataset splits.
Hardware Specification	No	The paper does not provide any specific details regarding the hardware used for running the experiments.
Software Dependencies	No	The paper does not provide any specific software dependencies with version numbers.
Experiment Setup	Yes	Consider the speciﬁcation of the problem with C = 10 chains, horizon (or equivalently number of vertices in each chain) H = 5, θc N(0, 100 + c), c {1, . . . , C} and likelihood of observed reward when the last edge of chain c is traversed rc\|θc N(θc, 1).