Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Nearly Minimax Optimal Regret for Multinomial Logistic Bandit
Authors: Joongkyu Lee, Min-hwan Oh
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We also conduct numerical experiments and show that our algorithm consistently outperforms the existing MNL bandit algorithms while maintaining a constant computation cost per round. Furthermore, the empirical results corroborate our theoretical findings regarding the dependence of regret on the reward structure, v0 and K. |
| Researcher Affiliation | Academia | Joongkyu Lee Seoul National University Seoul, South Korea EMAIL Min-hwan Oh Seoul National University Seoul, South Korea EMAIL |
| Pseudocode | Yes | Algorithm 1 OFU-MNL+ |
| Open Source Code | No | We have included the code in the supplementary material. After our paper is accepted, we will provide open access to the code. |
| Open Datasets | No | For each instance, we sample the true parameter w from a uniform distribution in r 1{ ?dsd. For the context features xti, we sample each xti independently and identically distributed (i.i.d.) from a multivariate Gaussian distribution Np0d, Idq and clip it to range r 1{ ?dsd. |
| Dataset Splits | No | The paper reports cumulative regret over 3000 rounds and average performance over 20 independent instances, but does not specify explicit training, validation, or test splits for the data. |
| Hardware Specification | Yes | The experiments are run on Xeon(R) Gold 6226R CPU @ 2.90GHz (16 cores). |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers. |
| Experiment Setup | Yes | For each instance, we sample the true parameter w from a uniform distribution in r 1{ ?dsd. For the context features xti, we sample each xti independently and identically distributed (i.i.d.) from a multivariate Gaussian distribution Np0d, Idq and clip it to range r 1{ ?dsd. This setup ensures compliance with Assumption 1. In the uniform reward setting (first row of Figure 1), the combinatorial optimization step to choose the assortment reduces to sorting items by their utility estimate. In the non-uniform reward setting (second row of Figure 1), the rewards are sampled from a uniform distribution in each round, i.e., rti Unifp0, 1q. |