Nearly Minimax Optimal Regret for Multinomial Logistic Bandit

Authors: Joongkyu Lee, Min-hwan Oh

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We also conduct numerical experiments and show that our algorithm consistently outperforms the existing MNL bandit algorithms while maintaining a constant computation cost per round. Furthermore, the empirical results corroborate our theoretical findings regarding the dependence of regret on the reward structure, v0 and K.
Researcher Affiliation Academia Joongkyu Lee Seoul National University Seoul, South Korea jklee0717@snu.ac.kr Min-hwan Oh Seoul National University Seoul, South Korea minoh@snu.ac.kr
Pseudocode Yes Algorithm 1 OFU-MNL+
Open Source Code No We have included the code in the supplementary material. After our paper is accepted, we will provide open access to the code.
Open Datasets No For each instance, we sample the true parameter w from a uniform distribution in r 1{ ?dsd. For the context features xti, we sample each xti independently and identically distributed (i.i.d.) from a multivariate Gaussian distribution Np0d, Idq and clip it to range r 1{ ?dsd.
Dataset Splits No The paper reports cumulative regret over 3000 rounds and average performance over 20 independent instances, but does not specify explicit training, validation, or test splits for the data.
Hardware Specification Yes The experiments are run on Xeon(R) Gold 6226R CPU @ 2.90GHz (16 cores).
Software Dependencies No The paper does not provide specific software dependencies with version numbers.
Experiment Setup Yes For each instance, we sample the true parameter w from a uniform distribution in r 1{ ?dsd. For the context features xti, we sample each xti independently and identically distributed (i.i.d.) from a multivariate Gaussian distribution Np0d, Idq and clip it to range r 1{ ?dsd. This setup ensures compliance with Assumption 1. In the uniform reward setting (first row of Figure 1), the combinatorial optimization step to choose the assortment reduces to sorting items by their utility estimate. In the non-uniform reward setting (second row of Figure 1), the rewards are sampled from a uniform distribution in each round, i.e., rti Unifp0, 1q.