Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Evaluating multiple models using labeled and unlabeled data

Authors: Divya Shanmugam, Shuvom Sadhuka, Manish Raghavan, John Guttag, Bonnie Berger, Emma Pierson

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present experiments in four domains where obtaining large labeled datasets is often impractical: healthcare, content moderation, molecular property prediction, and text classification. Our results demonstrate that SSME estimates performance more accurately than do competing methods, reducing error by 5.1 relative to using labeled data alone and 2.4 relative to the next best method.
Researcher Affiliation Academia Divya Shanmugam1* Shuvom Sadhuka2* Manish Raghavan2,3 John Guttag2 Bonnie Berger2** Emma Pierson4** 1Cornell University 2Massachusetts Institute of Technology 4University of California, Berkeley
Pseudocode Yes Algorithm 1 details the EM procedure we use to fit our mixture model. ... see Algorithm 2 for corresponding pseudocode.
Open Source Code Yes Our code is available at https://github.com/divyashan/SSME. Most of the data we use is open-access and does not require credentials.
Open Datasets Yes We evaluate SSME on five classification datasets: (1) MIMIC-IV [38]... (2) Civil Comments [9]... (3) OGB-SARS-Co V [33]... (4) Multi NLI [73]... and (5) AG News [75]...
Dataset Splits Yes We partition each dataset into three splits: the classifier training split (used to train the classifiers whose performance SSME estimates), the estimation split (used to fit SSME and estimate classifier performance), and the evaluation split (used to produce a held-out, ground-truth measure of classifier performance)... The estimation split consists of either 20, 50, or 100 labeled examples and 1000 unlabeled examples across all experiments. The size of the evaluation split is on the order of thousands of labeled examples and varies by task (see Appendix B.1 for exact split sizes).
Hardware Specification Yes All experiments can be done on a standard laptop without GPU access. ... Our normalizing flow is lightweight and trains in less than a minute for each dataset in our experiments section using 1 80GB NVIDIA A100 GPU.
Software Dependencies No We train a logistic regression with the default parameters associated with the scikit-learn implementation [54].
Experiment Setup Yes We optimize the parameters using EM over 1000 epochs. ... The estimation split consists of either 20, 50, or 100 labeled examples and 1000 unlabeled examples across all experiments. ... We implement Dawid-Skene with a tolerance of 1e-5 and a maximum number of EM iterations of 100 (the default parameters)...