Multi-Objective Generalized Linear Bandits
Authors: Shiyin Lu, Guanghui Wang, Yao Hu, Lijun Zhang
IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Numerical experiments demonstrate the effectiveness of our method. In this section, we conduct numerical experiments to compare our algorithm with the following multi-objective bandits algorithms. As can be seen from Fig. 1, where the vertical axis represents the cumulative Pareto regret up to round t, our algorithm significantly outperforms its competitors in all experiments. |
| Researcher Affiliation | Collaboration | Shiyin Lu1 , Guanghui Wang1 , Yao Hu2 and Lijun Zhang1 1National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China 2You Ku Cognitive and Intelligent Lab, Alibaba Group, Beijing 100102, China {lusy, wanggh, zhanglj}@lamda.nju.edu.cn, yaoohu@alibaba-inc.com |
| Pseudocode | Yes | Algorithm 1 MOGLB-UCB |
| Open Source Code | No | The paper does not provide a direct link to open-source code or explicitly state that the code is publicly available. |
| Open Datasets | No | We use a synthetic dataset constructed as follows. For each objective i [m], we sample the coefficients θi uniformly from the positive part of the unit ball. To control the size of the Pareto front, we generate the arm set comprised of 4d arms as follows. We first draw 3d arms uniformly from the centered ball whose radius is 0.5, and then sample d arms uniformly from the centered unit ball. We repeat this process until the size of the Pareto front is not more than d. |
| Dataset Splits | No | The paper uses a synthetic dataset but does not specify explicit training, validation, or test splits, nor does it mention cross-validation. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments. |
| Software Dependencies | No | The paper mentions two models (probit model and logit model) used for generating reward components, but does not list any specific software dependencies with version numbers. |
| Experiment Setup | Yes | In our algorithm, there is a parameter λ. Since its functionality is just to make Zt invertible and our algorithm is insensitive to it, we simply set λ = max(1, κ/2). Following common practice in bandits learning [Zhang et al., 2016; Jun et al., 2017], we also tune the width of the confidence set γt as c log det (Zt) / det (Z1), where c is searched within [1e 3, 1]. Let m = 5 and pick d from {5, 10, 15}. For each objective i [m], we sample the coefficients θi uniformly from the positive part of the unit ball. We perform 10 trials up to round T = 3000 and report average performance of the algorithms. |