Inferring Lexicographically-Ordered Rewards from Preferences

Authors: Alihan Hüyük, William R. Zame, Mihaela van der Schaar5737-5745

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We offer two example applications in healthcare—one inspired by cancer treatment, the other inspired by organ transplantation—to illustrate how the lexicographically-ordered rewards we learn can provide a better understanding of a decision-maker’s preferences and help improve policies when used in reinforcement learning.For each experiment, we take ϵ = 0.5 and generate 1000 trajectories with τ = 20 to form the demonstration set D. Then, we generate preferences by sampling 1000 trajectory pairs from D and evaluating according to the ground-truth reward functions {r1, r2} and the model given in (5), where ε1 = ε2 = 0.1 and α1 = α2 = 10 log(9) (ties are broken uniformly at random). These form the set of expert preferences P. We infer k = 2 reward functions from the expert preferences P using LORI; for comparison, we infer a single reward function using T-REX (Brown et al. 2019), which is the single-dimensional counterpart of LORI (with k = 1), and another single reward function from demonstrations D instead of preferences P using Bayesian IRL (Ramachandran and Amir 2007).
Researcher Affiliation Academia 1University of Cambridge 2University of California, Los Angeles 3The Alan Turing Institute
Pseudocode No The paper describes the model and inference process using mathematical formulations and textual descriptions but does not include a structured pseudocode or algorithm block.
Open Source Code No The paper does not contain an explicit statement about the release of open-source code for the described methodology, nor does it provide any links to a code repository.
Open Datasets Yes Our analysis is based on the Organ Procurement and Transplantation Network (OPTN) data for liver transplantations as of December 4, 2020.We estimate both benefit and need via Cox models following the same methodology as Transplant Benefit, which is used in the current allocation policy of the UK (Neuberger et al. 2008).
Dataset Splits No The paper mentions training models and evaluating on a 'test set', but it does not explicitly provide details about a validation dataset split (e.g., percentages, sample counts, or specific methodology for creating a validation set).
Hardware Specification No The paper does not provide any specific details about the hardware used to run the experiments (e.g., GPU/CPU models, memory, or cloud computing specifications).
Software Dependencies No The paper refers to various algorithms and models (e.g., T-REX, Bayesian IRL, Cox models) but does not list specific software dependencies with version numbers.
Experiment Setup Yes For each experiment, we take ϵ = 0.5 and generate 1000 trajectories with τ = 20 to form the demonstration set D. Then, we generate preferences by sampling 1000 trajectory pairs from D and evaluating according to the ground-truth reward functions {r1, r2} and the model given in (5), where ε1 = ε2 = 0.1 and α1 = α2 = 10 log(9) (ties are broken uniformly at random).(We set α1 = α2 = 1; there is no loss of generality because the other variables simply scale.)