Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Collusion Detection and Ground Truth Inference in Crowdsourcing for Labeling Tasks

Authors: Changyue Song, Kaibo Liu, Xi Zhang

JMLR 2021 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Numerical studies using synthetic and real data sets are also conducted to verify the performance of the method.
Researcher Affiliation	Academia	Changyue Song EMAIL School of Systems and Enterprises Stevens Institute of Technology Hoboken, NJ 07030, USA; Kaibo Liu EMAIL Department of Industrial and Systems Engineering University of Wisconsin-Madison Madison, WI 53706, USA; Xi Zhang EMAIL Department of Industrial Engineering and Management Peking University Beijing, 100871, China
Pseudocode	Yes	To tackle this issue, we propose to adopt a coordination descent (CD) algorithm as follows. Step 1: Update Hi,j by maximizing f(θ) with ai and m ﬁxed, i.e., Hk+1 = argmax H f(θ\|ai = ak i , m = mk) , where Hk+1 = {Hk+1 i,j , (i, j) P}. Step 2: Update ai and m by maximizing f(θ) with Hi,j ﬁxed, i.e., (ak+1 i , mk+1) = argmax ai,m f(θ\|H = Hk+1).
Open Source Code	No	The paper does not contain any explicit statements about releasing source code, nor does it provide a link to a code repository.
Open Datasets	Yes	In addition, we implement PROCAP to ﬁve publicly available data sets including bluebird, ducks, tweets, stage2, and rating. (1) The bluebird data set consists of worker-generated labels indicating whether an image contains Indigo Bunting or Blue Grosbeak (Welinder et al., 2010); (2) In the ducks data set, workers are presented with photos that may contain American Black Duck, Canada Goose, Mallard, Red-necked Grebe, or no bird, and need to identify whether the photo contains a duck or not (Welinder et al., 2010). (3) In the tweets data set, workers classify the sentiment of tweets as positive or negative (Mozafari et al., 2014); (4) In the stage2 data set, workers judge whether a document is related to a topic for document-topic pairs (Tang and Lease, 2011). This dataset was part of the TREC 2011 crowdsourcing track; (5) The rating data set consists of ratings on a scale of 1 to 10 for products, and the collusive behaviors of workers are identiﬁed by obtaining the admission of colluding workers (Khuda Bukhsh et al., 2014).
Dataset Splits	No	For the real data sets, the paper states: "All available worker-generated labels are used to estimate the ground truth, and the tasks with ground true labels available are used to calculate the accuracy of the inference." This describes an evaluation strategy but does not specify train/test/validation splits for reproducing the model's training process.
Hardware Specification	Yes	The numerical studies were conducted on a virtual machine with an Intel Xeon E5-2693V3 16-core 2.30-GHz processor and 32 GB RAM.
Software Dependencies	No	The paper mentions statistical methods and algorithms like adaptive LASSO, EM algorithm, and coordinate descent, but does not specify any programming languages, libraries, or software packages with version numbers.
Experiment Setup	Yes	The ground true labels yt for each task are randomly generated with the marginal probability m = [0.6, 0.4]T . In addition, we consider 10 workers. If working independently, each worker has a confusion matrix of a = [0.7 0.3; 0.3 0.7]. The first k workers belong to a colluding group with a colluding probability of h for each task. Speciﬁcally, with a probability of h, the k workers collude on a task, and they generate the same label according to a confusion matrix of b = [ρ 1-ρ; 1-ρ ρ] otherwise they generate the labels independently according to their own confusion matrices. ... Speciﬁcally, we consider scenarios with k = 3 and k = 5... and we consider ρ = 0.7, 0.5, 0.3, and 0... In each scenario, we consider two diﬀerent colluding probabilities including h = 0.5 and h = 1... For each scenario with a certain number of tasks, we replicate the simulation for 100 times. ... we initialize ai and m in the same way when implementing the algorithms proposed in Section 5. For Hi,j, we initialize H0 i,j = 0.5.