Bandits with Preference Feedback: A Stackelberg Game Perspective

Authors: Barna Pásztor, Parnian Kassraie, Andreas Krause

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments are on finding the maxima of test functions commonly used in (non-convex) optimization literature [Jamil and Yang, 2013], given only preference feedback. These functions cover challenging optimization landscapes including several local optima, plateaus, and valleys, allowing us to test the versatility of MAXMINLCB. We use the Ackley function for illustration in the main text and provide the regret plots for the remainder of the functions in Appendix E. For all experiments, we set the horizon T = 2000 and evaluate all algorithms on a uniform mesh over the input domain of size 100. Additionally, we conducted experiments on the Yelp restaurant review dataset to demonstrate the applicability of MAXMINLCB on real-world data and its scaling to larger domains.
Researcher Affiliation Academia Barna Pásztor ,1,2 Parnian Kassraie ,1 Andreas Krause1,2 1ETH Zurich 2ETH AI Center {bpasztor, pkassraie, krausea}@ethz.ch
Pseudocode Yes Algorithm 1 MAXMINLCB
Open Source Code Yes The code is made available at github.com/lasgroup/Max Min LCB.
Open Datasets Yes Additionally, we conducted experiments on the Yelp restaurant review dataset to demonstrate the applicability of MAXMINLCB on real-world data and its scaling to larger domains.
Dataset Splits No The paper mentions evaluating algorithms on a "uniform mesh" and using the Yelp dataset but does not specify explicit training/validation/test splits or percentages.
Hardware Specification No We ran our experiments on a shared cluster equipped with various NVIDIA GPUs and AMD EPYC CPUs. Our default configuration for all experiments was a single GPU with 24 GB of memory, 16 CPU cores, and 16 GB of RAM.
Software Dependencies No The environments and algorithms are implemented end-to-end in JAX [Bradbury et al., 2018].
Experiment Setup Yes We set δ = 0.1 for all algorithms. For GP-UCB and LGP-UCB, we set β = 1, and 0.25 for the noise variance. We use the Radial Basis Function (RBF) kernel and choose the variance and length scale parameters from [0.1, 1.0] to optimize their performance separately. For LGP-UCB, we tuned λ, the L2 penalty coefficient in Proposition 1, on the grid [0.0, 0.1, 1.0, 5.0] and B on [1.0, 2.0, 3.0]. The hyper-parameter selections were done for each algorithm separately to create a fair comparison.