reproducibilityindex.ai

Nearly Minimax Optimal Regret for Multinomial Logistic Bandit

Authors: Joongkyu Lee, Min-hwan Oh

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We also conduct numerical experiments and show that our algorithm consistently outperforms the existing MNL bandit algorithms while maintaining a constant computation cost per round. Furthermore, the empirical results corroborate our theoretical findings regarding the dependence of regret on the reward structure, v0 and K.
Researcher Affiliation	Academia	Joongkyu Lee Seoul National University Seoul, South Korea jklee0717@snu.ac.kr Min-hwan Oh Seoul National University Seoul, South Korea minoh@snu.ac.kr
Pseudocode	Yes	Algorithm 1 OFU-MNL+
Open Source Code	No	We have included the code in the supplementary material. After our paper is accepted, we will provide open access to the code.
Open Datasets	No	For each instance, we sample the true parameter w from a uniform distribution in r 1{ ?dsd. For the context features xti, we sample each xti independently and identically distributed (i.i.d.) from a multivariate Gaussian distribution Np0d, Idq and clip it to range r 1{ ?dsd.
Dataset Splits	No	The paper reports cumulative regret over 3000 rounds and average performance over 20 independent instances, but does not specify explicit training, validation, or test splits for the data.
Hardware Specification	Yes	The experiments are run on Xeon(R) Gold 6226R CPU @ 2.90GHz (16 cores).
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers.
Experiment Setup	Yes	For each instance, we sample the true parameter w from a uniform distribution in r 1{ ?dsd. For the context features xti, we sample each xti independently and identically distributed (i.i.d.) from a multivariate Gaussian distribution Np0d, Idq and clip it to range r 1{ ?dsd. This setup ensures compliance with Assumption 1. In the uniform reward setting (first row of Figure 1), the combinatorial optimization step to choose the assortment reduces to sorting items by their utility estimate. In the non-uniform reward setting (second row of Figure 1), the rewards are sampled from a uniform distribution in each round, i.e., rti Unifp0, 1q.