Online (Multinomial) Logistic Bandit: Improved Regret and Constant Computation Cost
Authors: Yu-Jie Zhang, Masashi Sugiyama
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 Experiments This section validates the statistical and computation efficiency of the proposed method by experiments. We conduct a bandit learning for T = 3000 iterations. In each experiment, we run experiments on 6 random configurations, in which the arm set and the underlying parameter are randomly sampled. Specially, |X| = 20 actions are randomly sampled from a 2-dimensional sphere of radius 1. In the binary case, the norm of the unknown parameter w is set as S = 5. In the multinomial case, we set K = 4 with S = 1. The reward vector is set as ρ = [0.25, 0.5, 0.75, 1]. For each configuration, we report the averaged results over 10 trials. Experimental Results. Figure 4 provides a comparison of performance and computation costs in the binary case. The algorithm O2LM [7], which has a constant computational cost per iteration, is used as a comparison baseline. Our algorithm demonstrates a time complexity akin to O2LM, affirming its computational efficiency. Compared to the state-of-the-art binary logistic bandit algorithm ada-OFU-ECOLog [10], our OFU-MLog B method has lighter computational overheads while preserving similar empirical performance. Figure 4 illustrates the comparison in the multinomial setting. Our algorithm is around 50 times faster than MNL-UCB [11] for running T = 3000 iterations and achieves better empirical performance. |
| Researcher Affiliation | Academia | Yu-Jie Zhang1 Masashi Sugiyama2,1 1 The University of Tokyo, Chiba, Japan 2 RIKEN AIP, Tokyo, Japan |
| Pseudocode | Yes | Algorithm 1 MNL-UCB+ Input: regularization coefficient λ, probability δ. 1: Initialize H1 = λIKd and W MLE 1 as any point in W 2: for t = 1, . . . , T do 3: Construct ert and select the arm by (5). Then, the learner receives yt. 4: Train the estimator W MLE t+1 by (3) and construct the confidence set Ct+1(δ) as (4). 5: end for |
| Open Source Code | No | Information is insufficient. The paper does not contain any explicit statements about releasing source code or provide links to a code repository. |
| Open Datasets | No | We conduct a bandit learning for T = 3000 iterations. In each experiment, we run experiments on 6 random configurations, in which the arm set and the underlying parameter are randomly sampled. Specially, |X| = 20 actions are randomly sampled from a 2-dimensional sphere of radius 1. Information is insufficient. The paper describes a synthetic data generation process rather than using or providing access information for a publicly available dataset. |
| Dataset Splits | No | We conduct a bandit learning for T = 3000 iterations. Information is insufficient. The paper describes a bandit learning problem, which does not typically involve fixed train/validation/test dataset splits in the same way as supervised learning tasks. Evaluation is based on cumulative regret over iterations. |
| Hardware Specification | Yes | The experiments are run on Xeon E-2288G processors (8 cores, 3.7GHz base, 5.0GHz boost). |
| Software Dependencies | No | Information is insufficient. The paper does not provide specific software dependencies with version numbers used for the experiments. |
| Experiment Setup | Yes | The parameter of all contenders are set according to their order as suggested in the corresponding paper. We use λ = d log(T) for Log-UCB1, λ = d for OL2M, λ = 1 for ada-OFU-ECOLog and λ = KαS for OFU-MLog B. The step size for OFUMLog B is set as η = S/2 + ln(K + 1)/4. |