Preference learning along multiple criteria: A game-theoretic perspective

Authors: Kush Bhatia, Ashwin Pananjady, Peter Bartlett, Anca Dragan, Martin J. Wainwright

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we showcase the practical utility of our framework in a user study on autonomous driving, where we find that the Blackwell winner outperforms the von Neumann winner for the overall preferences. Our experiment demonstrates that the Blackwell winner is able to better trade off utility along these criteria and produces randomized policies that outperform the von Neumann winner for the overall preferences.
Researcher Affiliation Academia Kush Bhatia EECS, UC Berkeley kush@cs.berkeley.edu Ashwin Pananjady Simons Institute, UC Berkeley ashwinpm@berkeley.edu Peter L. Bartlett EECS and Statistics, UC Berkeley peter@berkeley.edu Anca D. Dragan EECS, UC Berkeley anca@berkeley.edu Martin J. Wainwright EECS and Statistics, UC Berkeley wainwrig@berkeley.edu
Pseudocode No The paper discusses specific algorithms in Appendix C, but does not provide pseudocode or an explicit algorithm block in the main text or the provided abstract.
Open Source Code No The paper does not contain any explicit statement about releasing source code or a link to a code repository.
Open Datasets No The paper describes a user study and states that 'The cumulative comparison data is given in Appendix D', but it does not provide concrete access information (link, DOI, repository, or formal citation for external access) for this dataset.
Dataset Splits No The paper describes data collection from user studies but does not provide specific training/test/validation dataset splits, as it's not a typical machine learning model training setup.
Hardware Specification No The paper does not provide specific hardware details (like exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers.
Experiment Setup Yes For our base policies, we design five different reward functions encoding different self-driving behaviors. These polices, named Policy A-E, are then set to be the model predictive control based policies based on these reward functions wherein we fix the planning horizon to 6. A randomized policy π 5 is given by a distribution over the base policies A-E. Such a randomized policy is implemented in our environment by randomly sampling a base policy from the mixture distribution after every H = 18 time steps and executing this selected policy for that duration. To account for the randomization, we execute each such policy for 5 independent runs in each of the worlds and record these behaviors. ... We begin with the two target sets S1 and S2 for our evaluation of the Blackwell winner which were selected in a data-oblivious manner. Set S1 is an axis-aligned set promoting the use of safer policies with score vector constrained to have a larger value along the collision risk axis. Similar to Figure 2(b), the set S2 adds a linear constraint along aggressiveness and collision risk. This target set thus favors policies which are less aggressive and have lower collision risk. For evaluating hypothesis MH2, we considered several weight vectors, both oblivious and data-dependent, comprising average of the users self-reported weights, that obtained by regressing the overall criterion on C1-C5, and a set of oblivious weights.