Decentralized Stochastic Multi-Player Multi-Armed Walking Bandits

Authors: Guojun Xiong, Jian Li

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Numerical Evaluations Experiments on Constructed Instance. We consider N = 6 players and K = 100 arms with rewards drawn from Gaussian distributions with mean µk = 0.06(101 k) and σk = 0.01(101 k). Each player has three neighbor players in the communication graph G. At each time, we randomly assign 25 arms to each player with neighbor players possibly sharing some arms. All the regret and MSE values are averaged over 40 independent runs. Figure 1 compares the mean-square-error (MSE) between each arm s true mean reward and estimated mean reward with our proposed algorithms over a time horizon of T = 104 rounds. It is clear that sharing estimated rewards with neighbor players as in MPMAB-WA-UCB substantially improves the exploration efficiency compared to only sharing local walking arm sets as in MPMAB-WA-UCB-NR. This advantage results in a lower regret as shown in Figure 2, which is consistent with our theoretical performance guarantees. Finally, we observe that communication significantly improves the performance since communication is required to determine optimal matching and ranking to avoid collisions. Its importance is especially pronounced when players only have access to a dynamic local walking arm set as considered in this paper. Experiments on Wireless Downlink Scheduling. We further consider a wireless downlink scheduling problem (Li 2021; Li, Liu, and Ji 2019) that fits into our MPMAB-WA model, see (Xiong and Li 2022) for details. There are N = 6 base stations (BSs) and K = 10 walking users. Each BS covers a geographical region and each user randomly moves across the whole region with uniform distribution, i.e., each user moves into the region covered by BS n with a probability 1/N at each time slot. BSs are connected via a ring, i.e., each BS has two neighbors. The rewards of serving users in each slot (Huang, Hu, and Pan 2021) are i.i.d. drawn from Bernoulli distributions with mean rewards 0.95, 0.9, 0.85, 0.8, 0.75, 0.7, 0.65, 0.6, 0.55, 0.5. All MSE and regret reported in Figures 3 and 4 are averaged over 40 independent runs, from which we draw the same conclusions as above.
Researcher Affiliation Academia Guojun Xiong, Jian Li SUNY-Binghamton University {gxiong1,lij}@binghamton.edu
Pseudocode Yes Algorithm 1: MPMAB-WA-UCB for player i at time t [...] Algorithm 2: Learn2Match for player i at time t [...] Algorithm 3: Learn2Rank for player i at time t
Open Source Code No The paper does not provide a link or explicit statement about the availability of open-source code for the described methodology.
Open Datasets Yes Experiments on Wireless Downlink Scheduling. We further consider a wireless downlink scheduling problem (Li 2021; Li, Liu, and Ji 2019) that fits into our MPMAB-WA model, see (Xiong and Li 2022) for details. There are N = 6 base stations (BSs) and K = 10 walking users. Each BS covers a geographical region and each user randomly moves across the whole region with uniform distribution, i.e., each user moves into the region covered by BS n with a probability 1/N at each time slot. BSs are connected via a ring, i.e., each BS has two neighbors. The rewards of serving users in each slot (Huang, Hu, and Pan 2021) are i.i.d. drawn from Bernoulli distributions with mean rewards 0.95, 0.9, 0.85, 0.8, 0.75, 0.7, 0.65, 0.6, 0.55, 0.5.
Dataset Splits No The paper does not explicitly provide details about training/test/validation dataset splits. It describes constructed instances and the wireless downlink scheduling problem, but not how these datasets were split for training, validation, or testing.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies No The paper does not provide specific software dependencies with version numbers.
Experiment Setup No The paper does not provide specific experimental setup details such as hyperparameter values or training configurations. It mentions the number of players, arms, and neighbors, and the number of independent runs, but not parameters of the algorithm itself.