Coordinated Exploration in Concurrent Reinforcement Learning

Authors: Maria Dimakopoulou, Benjamin Van Roy

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we present computational results that demonstrate the robustness of seed sampling algorithms of Section 3.2 versus the baseline algorithms of Section 3.1. In sections 4.1 and 4.2, we present two simple problems that highlight the weaknesses of concurrent UCRL and Thompson resampling and demonstrate how severely performance may suffer due to violation of any among Properties 1, 2, 3. In Section 4.3, we demonstrate the relative efficiency of seed sampling in a more complex problem.
Researcher Affiliation Academia Maria Dimakopoulou 1 Benjamin Van Roy 1 1Stanford University, California, USA. Correspondence to: Maria Dimakopoulou <madima@stanford.edu>, Benjamin Van Roy <bvr@stanford.edu>.
Pseudocode No The paper describes the algorithms and their mathematical formulations in prose, but does not provide structured pseudocode or algorithm blocks.
Open Source Code No The paper provides a link to a demo video (https://youtu.be/xjGK-wm0PkI) but does not provide concrete access to the source code for the methodology described.
Open Datasets No The paper defines problem specifications like 'Bipolar Chain', 'Parallel Chains', and 'Maximum Reward Path' and describes how data for these scenarios is generated (e.g., 'we sample Erd os-R enyi graphs'), rather than referencing established public datasets with access information.
Dataset Splits No The paper describes simulated environments and experiments but does not explicitly provide details about train, validation, or test dataset splits.
Hardware Specification No The paper does not provide any specific details regarding the hardware used for running the experiments.
Software Dependencies No The paper does not provide any specific software dependencies with version numbers.
Experiment Setup Yes Consider the specification of the problem with C = 10 chains, horizon (or equivalently number of vertices in each chain) H = 5, θc N(0, 100 + c), c {1, . . . , C} and likelihood of observed reward when the last edge of chain c is traversed rc|θc N(θc, 1).