Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Learning Personalized Decision Support Policies
Authors: Umang Bhatt, Valerie Chen, Katherine M. Collins, Parameswaran Kamalaruban, Emma Kallina, Adrian Weller, Ameet Talwalkar
AAAI 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our computational experiments explore the utility of personalization across multiple expertise profiles. ... To validate Modiste on real users (N = 80), we conduct human subject experiments, where we explore forms of support that include expert consensus, outputs from an LLM, or predictions from a classification model. ... we demonstrate how Modiste can be used to learn personalized decision support policies online on both vision and language tasks. |
| Researcher Affiliation | Academia | 1New York University 2The Alan Turing Institute 3Carnegie Mellon University 4University of Cambridge EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1: Learning a decision support policy 1: Input: human decision-maker h 2: Initialization: data buffer D0 = {}; human error values {br Ai,0(x; h) = 0.5 : x X, Ai A}; initial policy ฯ1 3: for t = 1, 2, . . . , T do 4: data point (xt, yt) X Y is drawn iid from P 5: support at A is selected using policy ฯt 6: human makes the prediction eyt based on xt and at 7: human incurs the loss โ(yt, eyt) 8: update the buffer Dt Dt 1 {(xt, at, โ(yt, eyt))} 9: update the decision support policy: br Ai,t(x; h) Ur(br Ai,t 1(x; h), Dt), Ai A (Step 1) ฯt+1(x) Uฯ({br Ai,t}i) (Step 2) 10: end for 11: Output: policy ฯalg ฮป ฯT +1 |
| Open Source Code | Yes | We open-source Modiste as a tool to encourage the adoption of personalized decision support policies. |
| Open Datasets | Yes | 1. CIFAR-10 (Krizhevsky 2009), a 10-class image classification dataset; 2. MMLU (Hendrycks et al. 2020), a multi-task text-based benchmark that tests for knowledge and problem-solving ability across 57 topics in both the humanities and STEM. |
| Dataset Splits | No | The paper describes how they constructed tasks for CIFAR-3A and MMLU-2A, mentioning aspects like the number of images/questions for human interaction (100 for CIFAR-3A, 60 for MMLU-2A) or how classes were corrupted, but it does not specify any training, validation, or test dataset splits for machine learning models. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions using 'Instruct GPT3.5, text-davinci-003' for the LLM support, but it does not list any specific software libraries, frameworks, or operating systems with their version numbers that would be required to replicate the experiments. |
| Experiment Setup | Yes | Via pilot studies, we found that 100 CIFAR images or 60 MMLU questions were a reasonable number of decisions to make within 20-40 minutes (a typical time limit for an online study), which we use throughout our experiments. ... Algorithm 1: ... 2: Initialization: data buffer D0 = {}; human error values {br Ai,0(x; h) = 0.5 : x X, Ai A}; initial policy ฯ1 |