Preference-based Pure Exploration

Authors: Apurv Shukla, Debabrota Basu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We study the preference-based pure exploration problem for bandits with vectorvalued rewards. The rewards are ordered using a (given) preference cone C and our goal is to identify the set of Pareto optimal arms. First, to quantify the impact of preferences, we derive a novel lower bound on sample complexity for identifying the most preferred policy with a confidence level 1 δ. Our lower bound elicits the role played by the geometry of the preference cone and punctuates the difference in hardness compared to existing best-arm identification variants of the problem. We further explicate this geometry when the rewards follow Gaussian distributions. We then provide a convex relaxation of the lower bound and leverage it to design the Preference-based Track and Stop (Pre TS) algorithm that identifies the most preferred policy. Finally, we show that the sample complexity of Pre TS is asymptotically tight by deriving a new concentration inequality for vector-valued rewards. ... As future work, it would be interesting to verify our results on a real-world datasets.
Researcher Affiliation Academia Apurv Shukla Department of ECE, Texas A&M University College Station, TX 77840 apurv.shukla@umich.edu Debabrota Basu Equipe School, Univ. Lille, Inria, CNRS Centrale Lille, UMR-9189 CRISt AL, France debabrota.basu@inria.fr
Pseudocode Yes Algorithm 1 Preference-based Track-and-Stop (Pre TS) 1: Input: Confidence parameter δ, preference cone C 2: if supπ t min M ch Λπ t ( ˆ Mt) minz C P k Nk,td KL z ˆ M (k) t , z M (k) β(t, δ) then 3: Compute wt arg maxw K VC(w, ˆ Mt) 4: Play kt arg mink [K] Nk,t Pt s=1 ws 5: Observe reward rt 6: Construct an empirical estimator ˆ Mt 7: end if 8: Construct a Pareto Front ˆPτ from empirical means ˆ Mτ 9: Return: ˆPτ
Open Source Code No The paper does not contain any explicit statements or links indicating the release of open-source code for the described methodology.
Open Datasets No The paper is theoretical and does not present experiments that use a specific dataset, therefore it does not mention public access information for any dataset.
Dataset Splits No The paper is theoretical and does not conduct experiments with dataset splits for training, validation, or testing.
Hardware Specification No The paper is theoretical and does not describe any hardware used for running experiments.
Software Dependencies No The paper is theoretical and does not specify any software dependencies with version numbers for experimental replication.
Experiment Setup No The paper is theoretical and does not include details on experimental setup such as hyperparameters or training configurations.