Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Evaluating LLM-contaminated Crowdsourcing Data Without Ground Truth

Authors: Yichi Zhang, Jinlong Pang, Zhaowei Zhu, Yang Liu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we test our assumptions using two subjective labeling datasets and five types of commonly used LLMs. Our findings show that human responses consistently exhibit stronger correlations than LLM-generated responses when samples of Z and Zi originate from the same model, supporting the validity of Assumption 4.5. However, the correlations between independent samples of LLMs responses are usually non-zero, suggesting that Assumptions 4.1 and 4.2 are too strong to hold in practice. Finally, under scenarios where Assumption 4.5 holds, we evaluate the effectiveness of our method in detecting low-effort agents. Echoing our theoretical insights, we show that our approach has the most robust performance on mixed crowd compared with all the baselines. (Section 1) ... We evaluate the effectiveness of our proposed method using real-world crowdsourcing datasets. ... We focus on identifying low-effort agents with lazy-reporting strategies and comparing various methods based on their area under the ROC curve (AUC). (Section 5)
Researcher Affiliation	Academia	Yichi Zhang DIMACS, Rutgers University EMAIL Jinlong Pang University of California, Santa Cruz EMAIL Zhaowei Zhu University of California, Santa Cruz EMAIL Yang Liu University of California, Santa Cruz EMAIL
Pseudocode	Yes	Algorithm 1: The Correlated Agreement Mechanism [Shnayder et al., 2016] ... Algorithm 2: The conditioned CA mechanism
Open Source Code	Yes	The datasets and code of our experiments are available at https://github.com/yichiz97/LLM_contamination. (Section 5) ... Justification: We provide the code for our experiments in the supplementary material. (NeurIPS Paper Checklist, Question 5)
Open Datasets	Yes	Hatefulness/Offensiveness/Toxicity Labeling We first consider a toxicity labeling dataset which includes 3459 social media user comments posted in response to political news posts and videos on Twitter, You Tube, and Reddit in August 2021 [Sch opke-Gonzalez et al., 2025]. ... Preference Alignment We further use an alignment dataset where each question compares two LLM-generated answers to the same prompt. ... For more details, refer to Miranda et al. [2024]. (Section 5.1)
Dataset Splits	Yes	Specifically, we replace a randomly selected fraction of agents in the original dataset with simulated low-effort agents adopting lazy-reporting strategies. ... An αllm fraction are LLM-reliant agents, who report the LLM-generated labels on all assigned questions. An αr fraction are random agents, who generate labels by independently sampling from the marginal distribution over labels for each assigned question. An αb fraction are biased agents, who report the dataset s majority label on 90% of questions and choose uniformly at random on the rest. (Section 5.3.1) ... For each value of αllm ranging from 0 to 0.2, we randomly sample αr and αb uniformly from [0, 0.2] and report the average and error bars over 50 such trials. (Section 5.3.3)
Hardware Specification	Yes	Our paper does not include extensive GPU training. All experiments are lightweight and can be efficiently run on local machines, such as a standard Mac Book. (NeurIPS Paper Checklist, Question 8)
Software Dependencies	Yes	In this work, we select five well-known LLMs to generate labels or annotations for all used datasets, including GPT-3.5-turbo, GPT-4, Gemma-2-2b-it, Phi-3.5-mini-Instruct, and Mistral-7B-Instruct-v0.3. (Appendix G.1)
Experiment Setup	Yes	Specifically, we replace a randomly selected fraction of agents in the original dataset with simulated low-effort agents adopting lazy-reporting strategies. ... An αllm fraction are LLM-reliant agents, who report the LLM-generated labels on all assigned questions. An αr fraction are random agents, who generate labels by independently sampling from the marginal distribution over labels for each assigned question. An αb fraction are biased agents, who report the dataset s majority label on 90% of questions and choose uniformly at random on the rest. (Section 5.3.1) ... For each value of αllm ranging from 0 to 0.2, we randomly sample αr and αb uniformly from [0, 0.2] and report the average and error bars over 50 such trials. (Section 5.3.3) ... For each considered LLM (as detailed in Appendix G.1), we prompt it to independently generate responses to each question in the dataset three times using the default temperature. (Section 5.2)