Coordinated Exploration in Concurrent Reinforcement Learning
Authors: Maria Dimakopoulou, Benjamin Van Roy
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we present computational results that demonstrate the robustness of seed sampling algorithms of Section 3.2 versus the baseline algorithms of Section 3.1. In sections 4.1 and 4.2, we present two simple problems that highlight the weaknesses of concurrent UCRL and Thompson resampling and demonstrate how severely performance may suffer due to violation of any among Properties 1, 2, 3. In Section 4.3, we demonstrate the relative efficiency of seed sampling in a more complex problem. |
| Researcher Affiliation | Academia | Maria Dimakopoulou 1 Benjamin Van Roy 1 1Stanford University, California, USA. Correspondence to: Maria Dimakopoulou <madima@stanford.edu>, Benjamin Van Roy <bvr@stanford.edu>. |
| Pseudocode | No | The paper describes the algorithms and their mathematical formulations in prose, but does not provide structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper provides a link to a demo video (https://youtu.be/xjGK-wm0PkI) but does not provide concrete access to the source code for the methodology described. |
| Open Datasets | No | The paper defines problem specifications like 'Bipolar Chain', 'Parallel Chains', and 'Maximum Reward Path' and describes how data for these scenarios is generated (e.g., 'we sample Erd os-R enyi graphs'), rather than referencing established public datasets with access information. |
| Dataset Splits | No | The paper describes simulated environments and experiments but does not explicitly provide details about train, validation, or test dataset splits. |
| Hardware Specification | No | The paper does not provide any specific details regarding the hardware used for running the experiments. |
| Software Dependencies | No | The paper does not provide any specific software dependencies with version numbers. |
| Experiment Setup | Yes | Consider the specification of the problem with C = 10 chains, horizon (or equivalently number of vertices in each chain) H = 5, θc N(0, 100 + c), c {1, . . . , C} and likelihood of observed reward when the last edge of chain c is traversed rc|θc N(θc, 1). |