Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Online Decision Mediation

Authors: Daniel Jarrett, Alihan Hüyük, Mihaela van der Schaar

NeurIPS 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, through experiments and sensitivities on a variety of real-world datasets, we illustrate consistent gains over applicable benchmarks on a comprehensive set of performance measures with respect to the mediator policy, the learned model, and the entire decision-making system as a unit (Section 4).
Researcher Affiliation	Collaboration	Daniel Jarrett1 Alihan Hüyük1 Mihaela van der Schaar1,2,3 Department of Applied Mathematics and Theoretical Physics 1University of Cambridge, 2UCLA, 3Alan Turing Institute
Pseudocode	Yes	Algorithm 1 summarizes UMPIRE as applied to ODM.
Open Source Code	No	The code will be made available upon acceptance.
Open Datasets	Yes	Datasets We experiment with six environments. In Gauss Sine, synthetic points are generated in three categories by rounding a sinusoidal latent function on 2D Gaussian input [61]. In High Energy, the task is to identify signals in high energy particles registered in a Cherenkov gamma telescope [62]. In Motion Capture, the task is to recognize hand postures from data recorded by glove markers on users [63]. In Lunar Lander, the task is to perform actions in the Open AI gym [64] Atari environment, with the expert deﬁned as a PPO2 agent [65,66] trained on the true reward. In Alzheimers, the task is to perform early diagnosis of patients in the Alzheimer s Disease Neuroimaging Initiative study [67] as cognitively normal, mildly impaired, or at risk of dementia [19,20]. Lastly, in Cystic Fibrosis, the task is to perform diagnosis of patients enrolled in the UK Cystic Fibrosis registry [68] as to their GOLD grading in chronic obstructive pulmonary disease [69]. See Appendix B for additional detail.
Dataset Splits	No	Importantly, note that this is a more challenging objective than simply minimizing the generalization error of the model, system, or some asymptotic complexity thereof: Here we have no separation between training versus testing , since losses begin accumulating from the very ﬁrst step of the sequential process.
Hardware Specification	Yes	This work is not computation-heavy, so the details are not pertinent. (All work was done on a Mac Book Pro, 13-inch, 2017).
Software Dependencies	No	The paper states that the underlying model policy is implemented using 'Dirichlet-based Gaussian process classiﬁers [61,70 72]' but does not provide specific version numbers for any software dependencies (e.g., Python, PyTorch, scikit-learn, or other libraries).
Experiment Setup	Yes	Each experiment run consists of n=2000 rounds of interactions (except for the synthetic Gauss Sine, for which n=500), and this is repeated for a total of 10 runs with random seeds. ... In all experiments, we set kint =0.1, = 1 2, =10% where applicable, and = 0 as noted in Section 3.2.