Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Truthful Aggregation of LLMs with an Application to Online Advertising

Authors: Ermis Soumalias, Michael Curry, Sven Seuken

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Via experiments with publicly available LLMs, we show that MOSAIC leads to high advertiser value and platform revenue with low computational costs. In Section 6, we provide experimental results for the online advertising domain. We demonstrate that MOSAIC quickly converges to the optimal LLM with low computational cost, generating significant value for the advertisers and revenue for the auctioneer while also being useful to the user.
Researcher Affiliation	Academia	Ermis Soumalias University of Zurich ETH AI Center EMAIL Michael J. Curry University of Illinois Chicago EMAIL Sven Seuken University of Zurich ETH AI Center EMAIL
Pseudocode	Yes	Algorithm 1: Allocation Rule for MOSAIC Input: User prompt x, reference LLM πref, LLM used for candidate reply generation πgen, advertiser reward functions {ri}n i=1, number of candidate replies to generate M, reference LLM weight τ Output: Reply y drawn according to the optimal distribution as defined in Equation (1) for the aggregate reward function r(x, y) = PN i=1 ri(x, y)
Open Source Code	Yes	Our generated data and code are included in the supplemental material.
Open Datasets	Yes	We create synthetic instances, each comprising a user query (e.g., How to learn a musical instrument online? ) and two advertisers (e.g., Music Mastery, offering online music lessons ). This matches the setup of Dütting et al. [2024] while highlighting MOSAIC S performance and revenue, even in low competition scenarios. We use Llama-2-7b-chat-hf [Touvron et al., 2023] as the base architecture for all LLMs. In Appendices D.6 and D.10 we extend our analysis to settings with more advertisers and alternative architectures, observing similarly strong results. See Appendix D for details. We will make all of our code and set of synthetic instances publicly available.
Dataset Splits	No	The paper generates synthetic instances for testing purposes (e.g., 'We use 50 user queries and test each query on 25 different random seeds, resulting in 1,250 instances.') and does not describe traditional training/test/validation dataset splits for a pre-existing dataset.
Hardware Specification	Yes	All experiments were conducted on a compute cluster running Ubuntu 20.04.6 LTS with AMD EPYC processors with 48 cores and 1512GB RAM and Nvidia A100 GPUs and Python 3.12.1.
Software Dependencies	Yes	All experiments were conducted on a compute cluster running Ubuntu 20.04.6 LTS with AMD EPYC processors with 48 cores and 1512GB RAM and Nvidia A100 GPUs and Python 3.12.1.
Experiment Setup	Yes	Following Rafailov et al. [2023], the advertisers reward functions are defined as ri(x, y) = log πi(y\|x) πref(y\|x). For the auctioneer s objective, we set τ = 1 in Equation (1), balancing advertisers rewards and divergence from the reference LLM. We use Llama-2-7b-chat-hf [Touvron et al., 2023] as the base architecture for all LLMs. Following Li et al. [2024], Rozière et al. [2024] we sample from all LLMs using a temperature of 0.8 and top-p 0.95. We use 50 user queries and test each query on 25 different random seeds, resulting in 1,250 instances.