Nearly Minimax Optimal Regret for Multinomial Logistic Bandit
Authors: Joongkyu Lee, Min-hwan Oh
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We also conduct numerical experiments and show that our algorithm consistently outperforms the existing MNL bandit algorithms while maintaining a constant computation cost per round. Furthermore, the empirical results corroborate our theoretical findings regarding the dependence of regret on the reward structure, v0 and K. |
| Researcher Affiliation | Academia | Joongkyu Lee Seoul National University Seoul, South Korea jklee0717@snu.ac.kr Min-hwan Oh Seoul National University Seoul, South Korea minoh@snu.ac.kr |
| Pseudocode | Yes | Algorithm 1 OFU-MNL+ |
| Open Source Code | No | We have included the code in the supplementary material. After our paper is accepted, we will provide open access to the code. |
| Open Datasets | No | For each instance, we sample the true parameter w from a uniform distribution in r 1{ ?dsd. For the context features xti, we sample each xti independently and identically distributed (i.i.d.) from a multivariate Gaussian distribution Np0d, Idq and clip it to range r 1{ ?dsd. |
| Dataset Splits | No | The paper reports cumulative regret over 3000 rounds and average performance over 20 independent instances, but does not specify explicit training, validation, or test splits for the data. |
| Hardware Specification | Yes | The experiments are run on Xeon(R) Gold 6226R CPU @ 2.90GHz (16 cores). |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers. |
| Experiment Setup | Yes | For each instance, we sample the true parameter w from a uniform distribution in r 1{ ?dsd. For the context features xti, we sample each xti independently and identically distributed (i.i.d.) from a multivariate Gaussian distribution Np0d, Idq and clip it to range r 1{ ?dsd. This setup ensures compliance with Assumption 1. In the uniform reward setting (first row of Figure 1), the combinatorial optimization step to choose the assortment reduces to sorting items by their utility estimate. In the non-uniform reward setting (second row of Figure 1), the rewards are sampled from a uniform distribution in each round, i.e., rti Unifp0, 1q. |