Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

A Cramér–von Mises Approach to Incentivizing Truthful Data Sharing

Authors: Alex Clinton, Thomas Zeng, Yiding Chen, Xiaojin Zhu, Kirthevasan Kandasamy

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we demonstrate that our mechanism incentivizes truthful data sharing via simulations and on real-world language and image data. In 5, we empirically evaluate our methods on simulations, and real world image and language experiments.
Researcher Affiliation	Academia	Alex Clinton University of Wisconsin-Madison EMAIL Thomas Zeng University of Wisconsin-Madison EMAIL Yiding Chen Cornell University EMAIL Xiaojin Zhu University of Wisconsin-Madison EMAIL Kirthevasan Kandasamy University of Wisconsin-Madison EMAIL
Pseudocode	Yes	Algorithm 1 A single variable Cramér von Mises style statistic Algorithm 2 A feature-based Cramér von Mises style statistic Algorithm 3 A prior free Cramér von Mises style statistic
Open Source Code	Yes	Justification: Sufficient code to replicate the experiments is provided.
Open Datasets	Yes	Language data. Next, we evaluate our method and the above baselines on language data. For this, we use data from the SQu AD dataset [41], where each data point is a question about an article. Image data. We perform a similar experiment on image data using the Oxford Flowers-102 dataset [43] dataset. where each data point is an image of a flower.
Dataset Splits	Yes	We model the environment with m = 20 and m = 100 agents, where all agents have 2500 and 500 original data points respectively. We model the enviornment with m = 5 and m = 47, where all agents have roughly 1000 and 100 original data points respectively. We use the test dataset of [43], which consists of 4,612 images, to represent authentic data submitted by the other agents.
Hardware Specification	No	The paper does not explicitly state the hardware specifications used for running its experiments. It mentions various models like Llama 3.2-1B-Instruct and Segmind Stable Diffusion-1B, which are software components, but no details on the CPUs, GPUs, or memory of the machines where experiments were performed.
Software Dependencies	No	The paper mentions specific models and tools used for data generation or feature extraction (e.g., Llama 3.2-1B-Instruct [26], Distil BERT [42], Segmind Stable Diffusion-1B [25], De IT-small-distilled [44]), but it does not list specific version numbers for these or other general software dependencies (like Python, PyTorch, TensorFlow, etc.) used for implementing their proposed mechanism.
Experiment Setup	Yes	We model the environment with m = 20 and m = 100 agents, where all agents have 2500 and 500 original data points respectively. We instantiate Algorithm 3 with feature maps obtained from the feature layer of the Distil BERT [42] encoder model, which corresponds to 768 features. We apply the baselines to the same set of features and take the average. For simplicity, we chose the split map ψ(n) = 0. Table 4: Parameters and prompts used for Segmind Stable Diffusion-1B to generate the fabricated images. Here cls_name is replaced with the type of flower being generated. Parameter Value Text Prompt Photorealistic photograph of a single {cls_name}, realistic colors, natural lighting, high detail, sharp focus on petals. Another unique photo of the same flower species. Negative Prompt oversaturated, highly saturated, neon colors, garish colors, vibrant colors, illustration, painting, drawing, sketch, cartoon, anime, unrealistic, blurry, low quality, text, watermark, signature, border, frame, multiple flowers Strength 0.7 Guidance Scale 6 Num. Inference Steps 50