Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Stochastically Dominant Peer Prediction

Authors: Yichi Zhang, Shengwei Xu, Grant Schoenebeck, David M. Pennock

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we empirically compare the sensitivity of the proposed mechanisms across various settings. The results support our theoretical findings: EA consistently exhibits the highest sensitivity among SD-truthful mechanisms in the binary-signal setting (except when the signal prior is nearly uniform). ... We use two real-world datasets to estimate the information structure between two agents.
Researcher Affiliation	Academia	Yichi Zhang DIMACS, Rutgers University EMAIL Shengwei Xu University of Michigan, Ann Arbor EMAIL David Pennock DIMACS, Rutgers University EMAIL Grant Schoenebeck University of Michigan, Ann Arbor EMAIL
Pseudocode	No	The paper describes the mechanisms and their working principles using mathematical formulations and descriptive text, but it does not include explicitly labeled "Pseudocode" or "Algorithm" blocks.
Open Source Code	Yes	All codes for our experiments are provided in https://github.com/David Xu999/ Stochastically-Dominant-Peer-Prediction.
Open Datasets	Yes	We use two real-world datasets to estimate the information structure between two agents. The first dataset contains binary labels classifying whether a compound is appropriate or inappropriate to be synthesized [Baba et al., 2018]. The second dataset collects the annotations of the sentiment of 300 tweets, where the size of the signal space is 4 [Venanzi et al., 2015].
Dataset Splits	No	The paper mentions using two real-world datasets (J1 and J2) and describes their characteristics but does not provide specific details on how these datasets were split into training, testing, or validation sets for experiments.
Hardware Specification	No	Our experiments do not involve dense computing, and can be conducted on laptops.
Software Dependencies	No	The paper describes the methodologies and experiments but does not explicitly state specific version numbers for any software dependencies or libraries used (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	In particular, we simulate reports to each of the n questions with both agents exerting full effort e = 1 and run M to compute the scores. Repeating this process for T = 20,000 times yields T i.i.d. samples of SM(e). We then compute the score of Alice when she deviates to a lower effort level at e' = 0.8, and obtain T samples of SM(e').