Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Monoculture or Multiplicity: Which Is It?

Authors: Mila Gorecki, Moritz Hardt

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we conduct a comprehensive empirical evaluation to test both claims. We work from the premise that increasingly decision makers will use large language models for consequential prediction tasks. We therefore examine 50 language models, open source models ranging in size from 1B to 141B parameters and state-of-the-art commercial models, under 4 different prompt variations, and across 6 different prediction tasks.
Researcher Affiliation Academia Mila Gorecki Moritz Hardt Max Planck Institute for Intelligent Systems, Tübingen Tübingen AI Center EMAIL
Pseudocode No The paper describes methodologies and definitions but does not present any explicitly labeled pseudocode or algorithm blocks. The methods are described in narrative form.
Open Source Code Yes Code. We provide the code necessary to reproduce our analysis, along with a step-by-step guide for obtaining model predictions, available here: https://github.com/socialfoundations/mono-multi.
Open Datasets Yes Prediction tasks. We evaluate model predictions on seven binary classification tasks derived from three data sources. Five tasks are based on the American Community Survey (ACS) Public Use Microdata Sample (PUMS), a high-quality dataset from the U.S. Census Bureau [Flood et al., 2018]. ... The data comes from the Behavioral Risk Factor Surveillance System [BRFSS, Centers for Disease Control and Prevention (CDC), 2021]. ... SIPP is defined on the longitudinal Survey of Income and Program Participation [SIPP, U.S. Census Bureau, 2014].
Dataset Splits Yes We follow the default configuration provided by folktexts, adopting a random 80/10/10 split for training, validation, and test sets. All evaluations are performed exclusively on the test set; no models are trained in this study. For few-shot prompting experiments, we randomly sample 10 examples from the training set to construct the prompt context.
Hardware Specification Yes Resources used. We use an internal compute cluster with NVIDIA A100 and H100 GPUs. Zero-shot evaluation of all models required approximately 1000 GPU hours, while 10-shot prompting added an additional 2, 500 GPU hours.
Software Dependencies No The paper mentions using the 'folktexts package [Cruz et al., 2024]' and 'scikit-learn' for XGBoost implementation, but specific version numbers for these software dependencies are not provided in the text.
Experiment Setup Yes For each model and task, we fit a decision threshold t on n = 2000 samples from a validation set to maximize balanced accuracy. The threshold is then applied to turn the risk scores into class predictions. ... We use the implementation provided by scikit-learn, with default hyperparameters. No additional hyperparameter tuning was performed.