Preference learning along multiple criteria: A game-theoretic perspective
Authors: Kush Bhatia, Ashwin Pananjady, Peter Bartlett, Anca Dragan, Martin J. Wainwright
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we showcase the practical utility of our framework in a user study on autonomous driving, where we find that the Blackwell winner outperforms the von Neumann winner for the overall preferences. Our experiment demonstrates that the Blackwell winner is able to better trade off utility along these criteria and produces randomized policies that outperform the von Neumann winner for the overall preferences. |
| Researcher Affiliation | Academia | Kush Bhatia EECS, UC Berkeley kush@cs.berkeley.edu Ashwin Pananjady Simons Institute, UC Berkeley ashwinpm@berkeley.edu Peter L. Bartlett EECS and Statistics, UC Berkeley peter@berkeley.edu Anca D. Dragan EECS, UC Berkeley anca@berkeley.edu Martin J. Wainwright EECS and Statistics, UC Berkeley wainwrig@berkeley.edu |
| Pseudocode | No | The paper discusses specific algorithms in Appendix C, but does not provide pseudocode or an explicit algorithm block in the main text or the provided abstract. |
| Open Source Code | No | The paper does not contain any explicit statement about releasing source code or a link to a code repository. |
| Open Datasets | No | The paper describes a user study and states that 'The cumulative comparison data is given in Appendix D', but it does not provide concrete access information (link, DOI, repository, or formal citation for external access) for this dataset. |
| Dataset Splits | No | The paper describes data collection from user studies but does not provide specific training/test/validation dataset splits, as it's not a typical machine learning model training setup. |
| Hardware Specification | No | The paper does not provide specific hardware details (like exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers. |
| Experiment Setup | Yes | For our base policies, we design five different reward functions encoding different self-driving behaviors. These polices, named Policy A-E, are then set to be the model predictive control based policies based on these reward functions wherein we fix the planning horizon to 6. A randomized policy π 5 is given by a distribution over the base policies A-E. Such a randomized policy is implemented in our environment by randomly sampling a base policy from the mixture distribution after every H = 18 time steps and executing this selected policy for that duration. To account for the randomization, we execute each such policy for 5 independent runs in each of the worlds and record these behaviors. ... We begin with the two target sets S1 and S2 for our evaluation of the Blackwell winner which were selected in a data-oblivious manner. Set S1 is an axis-aligned set promoting the use of safer policies with score vector constrained to have a larger value along the collision risk axis. Similar to Figure 2(b), the set S2 adds a linear constraint along aggressiveness and collision risk. This target set thus favors policies which are less aggressive and have lower collision risk. For evaluating hypothesis MH2, we considered several weight vectors, both oblivious and data-dependent, comprising average of the users self-reported weights, that obtained by regressing the overall criterion on C1-C5, and a set of oblivious weights. |