Online (Multinomial) Logistic Bandit: Improved Regret and Constant Computation Cost

Authors: Yu-Jie Zhang, Masashi Sugiyama

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5 Experiments This section validates the statistical and computation efficiency of the proposed method by experiments. We conduct a bandit learning for T = 3000 iterations. In each experiment, we run experiments on 6 random configurations, in which the arm set and the underlying parameter are randomly sampled. Specially, |X| = 20 actions are randomly sampled from a 2-dimensional sphere of radius 1. In the binary case, the norm of the unknown parameter w is set as S = 5. In the multinomial case, we set K = 4 with S = 1. The reward vector is set as ρ = [0.25, 0.5, 0.75, 1]. For each configuration, we report the averaged results over 10 trials. Experimental Results. Figure 4 provides a comparison of performance and computation costs in the binary case. The algorithm O2LM [7], which has a constant computational cost per iteration, is used as a comparison baseline. Our algorithm demonstrates a time complexity akin to O2LM, affirming its computational efficiency. Compared to the state-of-the-art binary logistic bandit algorithm ada-OFU-ECOLog [10], our OFU-MLog B method has lighter computational overheads while preserving similar empirical performance. Figure 4 illustrates the comparison in the multinomial setting. Our algorithm is around 50 times faster than MNL-UCB [11] for running T = 3000 iterations and achieves better empirical performance.
Researcher Affiliation Academia Yu-Jie Zhang1 Masashi Sugiyama2,1 1 The University of Tokyo, Chiba, Japan 2 RIKEN AIP, Tokyo, Japan
Pseudocode Yes Algorithm 1 MNL-UCB+ Input: regularization coefficient λ, probability δ. 1: Initialize H1 = λIKd and W MLE 1 as any point in W 2: for t = 1, . . . , T do 3: Construct ert and select the arm by (5). Then, the learner receives yt. 4: Train the estimator W MLE t+1 by (3) and construct the confidence set Ct+1(δ) as (4). 5: end for
Open Source Code No Information is insufficient. The paper does not contain any explicit statements about releasing source code or provide links to a code repository.
Open Datasets No We conduct a bandit learning for T = 3000 iterations. In each experiment, we run experiments on 6 random configurations, in which the arm set and the underlying parameter are randomly sampled. Specially, |X| = 20 actions are randomly sampled from a 2-dimensional sphere of radius 1. Information is insufficient. The paper describes a synthetic data generation process rather than using or providing access information for a publicly available dataset.
Dataset Splits No We conduct a bandit learning for T = 3000 iterations. Information is insufficient. The paper describes a bandit learning problem, which does not typically involve fixed train/validation/test dataset splits in the same way as supervised learning tasks. Evaluation is based on cumulative regret over iterations.
Hardware Specification Yes The experiments are run on Xeon E-2288G processors (8 cores, 3.7GHz base, 5.0GHz boost).
Software Dependencies No Information is insufficient. The paper does not provide specific software dependencies with version numbers used for the experiments.
Experiment Setup Yes The parameter of all contenders are set according to their order as suggested in the corresponding paper. We use λ = d log(T) for Log-UCB1, λ = d for OL2M, λ = 1 for ada-OFU-ECOLog and λ = KαS for OFU-MLog B. The step size for OFUMLog B is set as η = S/2 + ln(K + 1)/4.