Contextual Bandits and Imitation Learning with Preference-Based Active Queries

Authors: Ayush Sekhari, Karthik Sridharan, Wen Sun, Runzhe Wu

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical For the contextual bandit setting, our algorithm achieves a regret bound that combines the best of both worlds, scaling as O(min{T, d/ }), where T represents the number of interactions, d represents the eluder dimension of the function class, and represents the minimum preference of the optimal action over any suboptimal action under all contexts. Our algorithm does not require the knowledge of , and the obtained regret bound is comparable to what can be achieved in the standard contextual bandits setting where the learner observes reward signals at each round. Additionally, our algorithm makes only O(min{T, d2/ 2}) queries to the expert. We then extend our algorithm to the imitation learning setting, where the agent engages with an unknown environment in episodes of length H, and provide similar guarantees regarding regret and query complexity. Interestingly, with preference-based feedback, our imitation learning algorithm can learn a policy outperforming a sub-optimal expert, matching the result from interactive imitation learning algorithms [Ross and Bagnell, 2014] that require access to the expert s actions and also reward signals.
Researcher Affiliation Academia Ayush Sekhari MIT sekhari@mit.edu Karthik Sridharan Cornell University ks999@cornell.edu Wen Sun Cornell University ws455@cornell.edu Runzhe Wu Cornell University rw646@cornell.edu
Pseudocode Yes Algorithm 1 Active preference q Ue Ry f OR contextual b Andits (AURORA) and Algorithm 2 Active preference q Ue Ry f OR imit Ation l Earning (AURORAE)
Open Source Code No The paper does not provide any statements about releasing open-source code or links to a code repository.
Open Datasets No This is a theoretical paper that focuses on algorithm design and proofs; therefore, it does not use or refer to specific public datasets for training.
Dataset Splits No This paper is theoretical and does not involve empirical experiments with datasets, so it does not discuss training, validation, or test splits.
Hardware Specification No The paper focuses on theoretical algorithms and their mathematical guarantees, not on empirical experimentation. Therefore, no hardware specifications for running experiments are provided.
Software Dependencies No The paper is theoretical and does not include details on software dependencies with specific version numbers, as it does not describe an implementation of experiments.
Experiment Setup No The paper is theoretical, presenting algorithms and their theoretical guarantees, not an experimental setup with hyperparameters or training details.