Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Generative Social Choice: The Next Generation

Authors: Niclas Boehmer, Sara Fish, Ariel D. Procaccia

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present the Proportional Slate Engine (PROSE) and evaluate it in experiments. [...] We evaluate PROSE on four instances drawn from drug reviews and a deliberation hosted on Polis. [...] In each case, PROSE outperforms four baseline approaches with respect to both user satisfaction and proportionality. We present a quantitative evaluation of the generated slates in Table 1.
Researcher Affiliation	Academia	1Hasso Plattner Institute, Germany 2Harvard University, USA. Correspondence to: Niclas Boehmer <EMAIL>, Sara Fish <EMAIL>.
Pseudocode	Yes	Algorithm 1 Democratic Process C,f(N, B, r)
Open Source Code	Yes	The code for PROSE and our other experiments is available at github.com/sara-fish/gen-soc-choice-next-gen.
Open Datasets	Yes	First, the publicly available UCI ML Drug Review dataset (Gr aßer et al., 2018) [...] Second, the Bowling Green dataset is drawn from a public deliberation hosted on Polis (2023)
Dataset Splits	No	From this dataset, we create three subsampled instances (each with 80 agents): Birth Control (Balanced), which contains reviews of a birth control medication with all ratings appearing equally often; Birth Control (Imbalanced), which includes only birth control reviews with extreme and central ratings, i.e., (1,2,5,9,10); and Obesity, which contains reviews on a obesity medication with all ratings appearing in equal frequency.
Hardware Specification	Yes	with runtimes of 31 65 minutes on a single Intel i7-8565U CPU @ 1.80GHz.
Software Dependencies	Yes	PROSE leverages GPT-4o when answering discriminative or generative queries. [...] We embed each agent using their description via Open AI s embedding-3-large.
Experiment Setup	Yes	In particular, for the three drug review instances, we use C = [80, 70, 60, 50, 40, 36, 32, 28, 24, 20, 16, 12, 10, 8, 6, 4, 2], while for bowlinggreen which has a different word budget per agent, we use C = [80, 60, 40, 36, 32, 28, 24, 20, 16, 12, 8, 4]. Approval Levels We use ℓ= [5.5, 5, 4.5, 4, 3.5, 3, 2, 1, 0] for each of the instances.