Coordinated Versus Decentralized Exploration In Multi-Agent Multi-Armed Bandits

Authors: Mithun Chakraborty, Kai Yee Phoebe Chua, Sanmay Das, Brendan Juba

IJCAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We then introduce an algorithm for the decentralized setting that uses a value-of-information based communication strategy and an exploration-exploitation strategy based on the centralized algorithm, and show experimentally that it converges rapidly to the performance of the centralized method. (Abstract); In this section, we describe two sets of experiments we ran to compare the performance of the decentralized multi-agent MAB exploration-exploitation algorithm with Vo I communication strategy that we proposed in Section 5 with several benchmarks described below. (Section 6)
Researcher Affiliation Academia 1Washington University in St. Louis 2University of California, Irvine
Pseudocode Yes Figure 1: Flow-chart showing the steps of the Vo I and simple communication strategies for an arbitrary agent j at the end of an action round, as described in Section 5.
Open Source Code No The paper does not contain any explicit statement about releasing open-source code for the methodology, nor does it provide a link to a code repository.
Open Datasets No The paper uses synthetic data generated based on specified parameters (Gaussian rewards, arithmetic sequence means) but does not provide access information (link, DOI, citation) to a public or open dataset.
Dataset Splits No The paper describes a simulation environment and experiments, but it does not specify training, validation, or test dataset splits (e.g., percentages or sample counts) as would be common for machine learning datasets.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU/CPU models, memory) used for conducting the experiments.
Software Dependencies No The paper does not specify any ancillary software dependencies (e.g., libraries, frameworks, solvers) along with their version numbers.
Experiment Setup Yes For each experiment, the number of agents is set at m = 25n, n being the number of arms The means of the Gaussian reward distributions on the bandit arms form a decreasing arithmetic sequence starting at µmax = µ1 = 1 and ending at µmin = µn = 0.05, so that the magnitude of the common difference is Θ( 1/n); the shared standard deviation σ = 0.1 is independent of the number of arms. (Section 6); Each data-point is generated by averaging the regret values over Nsim = 105 repetitions. We set δ = 0.01, εvoi = 0.05 Nsim to ensure that our confidence bounds hold for all experiments. (Section 6)