Adaptive Exploration-Exploitation Tradeoff for Opportunistic Bandits

Authors: Huasen Wu, Xueying Guo, Xin Liu

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we propose and study opportunistic bandits a new variant of bandits where the regret of pulling a suboptimal arm varies under different environmental conditions, such as network load or produce price. When the load/price is low, so is the cost/regret of pulling a suboptimal arm (e.g., trying a suboptimal network configuration). Therefore, intuitively, we could explore more when the load/price is low and exploit more when the load/price is high. Inspired by this intuition, we propose an Adaptive Upper-Confidence-Bound (Ada UCB) algorithm to adaptively balance the exploration-exploitation tradeoff for opportunistic bandits. We prove that Ada UCB achieves O(log T) regret with a smaller coefficient than the traditional UCB algorithm. Furthermore, Ada UCB achieves O(1) regret with respect to T if the exploration cost is zero when the load level is below a certain threshold. Last, based on both synthetic data and real-world traces, experimental results show that Ada UCB significantly outperforms other bandit algorithms, such as UCB and TS (Thompson Sampling), under large load/price fluctuations.
Researcher Affiliation Collaboration 1Twitter Inc., San Francisco, California, USA; 2University of California, Davis, California, USA.
Pseudocode Yes Algorithm 1 Ada UCB
Open Source Code No The paper does not provide any specific link or statement about releasing its own source code for the methodology described.
Open Datasets Yes We use experiment data from Speedometer (Speedometer, https://storage.cloud.google.com/speedometer) and another anonymous operator to conduct the evaluation.
Dataset Splits No The paper mentions using synthetic and real-world traces for evaluation but does not specify any training, validation, or test dataset splits.
Hardware Specification No The paper does not provide any specific details about the hardware used to run the experiments (e.g., GPU/CPU models, memory).
Software Dependencies No The paper does not list specific software dependencies with version numbers required for reproducibility.
Experiment Setup Yes In both Ada UCB and UCB(α), we set α as α = 0.51, which is close to 1/2 and performs better than a larger α. We consider a 5-armed bandit with Bernoulli rewards, where the expected reward vector is [0.05, 0.1, 0.15, 0.2, 0.25]. ... we choose l( ) and l(+) such that P{Lt l( )} = P{Lt l(+)} = 0.05.