reproducibilityindex.ai

Inferring Lexicographically-Ordered Rewards from Preferences

Authors: Alihan Hüyük, William R. Zame, Mihaela van der Schaar5737-5745

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We offer two example applications in healthcare—one inspired by cancer treatment, the other inspired by organ transplantation—to illustrate how the lexicographically-ordered rewards we learn can provide a better understanding of a decision-maker’s preferences and help improve policies when used in reinforcement learning.For each experiment, we take ϵ = 0.5 and generate 1000 trajectories with τ = 20 to form the demonstration set D. Then, we generate preferences by sampling 1000 trajectory pairs from D and evaluating according to the ground-truth reward functions {r1, r2} and the model given in (5), where ε1 = ε2 = 0.1 and α1 = α2 = 10 log(9) (ties are broken uniformly at random). These form the set of expert preferences P. We infer k = 2 reward functions from the expert preferences P using LORI; for comparison, we infer a single reward function using T-REX (Brown et al. 2019), which is the single-dimensional counterpart of LORI (with k = 1), and another single reward function from demonstrations D instead of preferences P using Bayesian IRL (Ramachandran and Amir 2007).
Researcher Affiliation	Academia	1University of Cambridge 2University of California, Los Angeles 3The Alan Turing Institute
Pseudocode	No	The paper describes the model and inference process using mathematical formulations and textual descriptions but does not include a structured pseudocode or algorithm block.
Open Source Code	No	The paper does not contain an explicit statement about the release of open-source code for the described methodology, nor does it provide any links to a code repository.
Open Datasets	Yes	Our analysis is based on the Organ Procurement and Transplantation Network (OPTN) data for liver transplantations as of December 4, 2020.We estimate both beneﬁt and need via Cox models following the same methodology as Transplant Beneﬁt, which is used in the current allocation policy of the UK (Neuberger et al. 2008).
Dataset Splits	No	The paper mentions training models and evaluating on a 'test set', but it does not explicitly provide details about a validation dataset split (e.g., percentages, sample counts, or specific methodology for creating a validation set).
Hardware Specification	No	The paper does not provide any specific details about the hardware used to run the experiments (e.g., GPU/CPU models, memory, or cloud computing specifications).
Software Dependencies	No	The paper refers to various algorithms and models (e.g., T-REX, Bayesian IRL, Cox models) but does not list specific software dependencies with version numbers.
Experiment Setup	Yes	For each experiment, we take ϵ = 0.5 and generate 1000 trajectories with τ = 20 to form the demonstration set D. Then, we generate preferences by sampling 1000 trajectory pairs from D and evaluating according to the ground-truth reward functions {r1, r2} and the model given in (5), where ε1 = ε2 = 0.1 and α1 = α2 = 10 log(9) (ties are broken uniformly at random).(We set α1 = α2 = 1; there is no loss of generality because the other variables simply scale.)