Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Learning to Defer with Limited Expert Predictions
Authors: Patrick Hemmer, Lukas Thede, Michael Vössing, Johannes Jakubik, Niklas Kühl
AAAI 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our approach on two public datasets. One with synthetically generated human experts and another from the medical domain containing real-world radiologists predictions. Our experiments show that the approach allows the training of various learning to defer algorithms with a minimal number of human expert predictions. |
| Researcher Affiliation | Academia | Karlsruhe Institute of Technology EMAIL, EMAIL |
| Pseudocode | Yes | We formalize the approach in Algorithm A1 in the Appendix. |
| Open Source Code | Yes | Further implementation details and results are presented in the Appendix, which we provide together with the code at https://github.com/ptrckhmmr/learning-to-defer-with-limited-expert-predictions. |
| Open Datasets | Yes | We empirically demonstrate the efficiency of our approach on the CIFAR-100 dataset (Krizhevsky 2009) using synthetically generated human expert predictions and on the NIH chest X-ray dataset (Majkowska et al. 2020; Wang et al. 2017) that provides real-world individual radiologists predictions. |
| Dataset Splits | Yes | We allocate 40,000 images to the training and 10,000 images to the validation split while reserving 10,000 images for the test split. |
| Hardware Specification | No | The paper does not specify the particular hardware components (e.g., GPU model, CPU model, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions models like Efficient Net-B1 and Res Net18 and optimizers like SGD, but does not provide specific version numbers for software dependencies (e.g., Python, PyTorch, TensorFlow, CUDA). |
| Experiment Setup | Yes | We train the embedding model for 200 epochs using SGD as an optimizer with Nesterov momentum and a learning rate of 0.1. Each expertise predictor model of our Embedding SSL approaches is trained for 50 epochs using SGD with a learning rate of 0.03. |