Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Monoculture or Multiplicity: Which Is It?

Authors: Mila Gorecki, Moritz Hardt

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we conduct a comprehensive empirical evaluation to test both claims. We work from the premise that increasingly decision makers will use large language models for consequential prediction tasks. We therefore examine 50 language models, open source models ranging in size from 1B to 141B parameters and state-of-the-art commercial models, under 4 different prompt variations, and across 6 different prediction tasks.
Researcher Affiliation	Academia	Mila Gorecki Moritz Hardt Max Planck Institute for Intelligent Systems, Tübingen Tübingen AI Center EMAIL
Pseudocode	No	The paper describes methodologies and definitions but does not present any explicitly labeled pseudocode or algorithm blocks. The methods are described in narrative form.
Open Source Code	Yes	Code. We provide the code necessary to reproduce our analysis, along with a step-by-step guide for obtaining model predictions, available here: https://github.com/socialfoundations/mono-multi.
Open Datasets	Yes	Prediction tasks. We evaluate model predictions on seven binary classification tasks derived from three data sources. Five tasks are based on the American Community Survey (ACS) Public Use Microdata Sample (PUMS), a high-quality dataset from the U.S. Census Bureau [Flood et al., 2018]. ... The data comes from the Behavioral Risk Factor Surveillance System [BRFSS, Centers for Disease Control and Prevention (CDC), 2021]. ... SIPP is defined on the longitudinal Survey of Income and Program Participation [SIPP, U.S. Census Bureau, 2014].
Dataset Splits	Yes	We follow the default configuration provided by folktexts, adopting a random 80/10/10 split for training, validation, and test sets. All evaluations are performed exclusively on the test set; no models are trained in this study. For few-shot prompting experiments, we randomly sample 10 examples from the training set to construct the prompt context.
Hardware Specification	Yes	Resources used. We use an internal compute cluster with NVIDIA A100 and H100 GPUs. Zero-shot evaluation of all models required approximately 1000 GPU hours, while 10-shot prompting added an additional 2, 500 GPU hours.
Software Dependencies	No	The paper mentions using the 'folktexts package [Cruz et al., 2024]' and 'scikit-learn' for XGBoost implementation, but specific version numbers for these software dependencies are not provided in the text.
Experiment Setup	Yes	For each model and task, we fit a decision threshold t on n = 2000 samples from a validation set to maximize balanced accuracy. The threshold is then applied to turn the risk scores into class predictions. ... We use the implementation provided by scikit-learn, with default hyperparameters. No additional hyperparameter tuning was performed.