Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Preference-based Pure Exploration
Authors: Apurv Shukla, Debabrota Basu
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | We study the preference-based pure exploration problem for bandits with vectorvalued rewards. The rewards are ordered using a (given) preference cone C and our goal is to identify the set of Pareto optimal arms. First, to quantify the impact of preferences, we derive a novel lower bound on sample complexity for identifying the most preferred policy with a confidence level 1 δ. Our lower bound elicits the role played by the geometry of the preference cone and punctuates the difference in hardness compared to existing best-arm identification variants of the problem. We further explicate this geometry when the rewards follow Gaussian distributions. We then provide a convex relaxation of the lower bound and leverage it to design the Preference-based Track and Stop (Pre TS) algorithm that identifies the most preferred policy. Finally, we show that the sample complexity of Pre TS is asymptotically tight by deriving a new concentration inequality for vector-valued rewards. ... As future work, it would be interesting to verify our results on a real-world datasets. |
| Researcher Affiliation | Academia | Apurv Shukla Department of ECE, Texas A&M University College Station, TX 77840 EMAIL Debabrota Basu Equipe School, Univ. Lille, Inria, CNRS Centrale Lille, UMR-9189 CRISt AL, France EMAIL |
| Pseudocode | Yes | Algorithm 1 Preference-based Track-and-Stop (Pre TS) 1: Input: Confidence parameter δ, preference cone C 2: if supπ t min M ch Λπ t ( ˆ Mt) minz C P k Nk,td KL z ˆ M (k) t , z M (k) β(t, δ) then 3: Compute wt arg maxw K VC(w, ˆ Mt) 4: Play kt arg mink [K] Nk,t Pt s=1 ws 5: Observe reward rt 6: Construct an empirical estimator ˆ Mt 7: end if 8: Construct a Pareto Front ˆPτ from empirical means ˆ Mτ 9: Return: ˆPτ |
| Open Source Code | No | The paper does not contain any explicit statements or links indicating the release of open-source code for the described methodology. |
| Open Datasets | No | The paper is theoretical and does not present experiments that use a specific dataset, therefore it does not mention public access information for any dataset. |
| Dataset Splits | No | The paper is theoretical and does not conduct experiments with dataset splits for training, validation, or testing. |
| Hardware Specification | No | The paper is theoretical and does not describe any hardware used for running experiments. |
| Software Dependencies | No | The paper is theoretical and does not specify any software dependencies with version numbers for experimental replication. |
| Experiment Setup | No | The paper is theoretical and does not include details on experimental setup such as hyperparameters or training configurations. |