Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Collusion Detection and Ground Truth Inference in Crowdsourcing for Labeling Tasks
Authors: Changyue Song, Kaibo Liu, Xi Zhang
JMLR 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Numerical studies using synthetic and real data sets are also conducted to verify the performance of the method. |
| Researcher Affiliation | Academia | Changyue Song EMAIL School of Systems and Enterprises Stevens Institute of Technology Hoboken, NJ 07030, USA; Kaibo Liu EMAIL Department of Industrial and Systems Engineering University of Wisconsin-Madison Madison, WI 53706, USA; Xi Zhang EMAIL Department of Industrial Engineering and Management Peking University Beijing, 100871, China |
| Pseudocode | Yes | To tackle this issue, we propose to adopt a coordination descent (CD) algorithm as follows. Step 1: Update Hi,j by maximizing f(θ) with ai and m fixed, i.e., Hk+1 = argmax H f(θ|ai = ak i , m = mk) , where Hk+1 = {Hk+1 i,j , (i, j) P}. Step 2: Update ai and m by maximizing f(θ) with Hi,j fixed, i.e., (ak+1 i , mk+1) = argmax ai,m f(θ|H = Hk+1). |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code, nor does it provide a link to a code repository. |
| Open Datasets | Yes | In addition, we implement PROCAP to five publicly available data sets including bluebird, ducks, tweets, stage2, and rating. (1) The bluebird data set consists of worker-generated labels indicating whether an image contains Indigo Bunting or Blue Grosbeak (Welinder et al., 2010); (2) In the ducks data set, workers are presented with photos that may contain American Black Duck, Canada Goose, Mallard, Red-necked Grebe, or no bird, and need to identify whether the photo contains a duck or not (Welinder et al., 2010). (3) In the tweets data set, workers classify the sentiment of tweets as positive or negative (Mozafari et al., 2014); (4) In the stage2 data set, workers judge whether a document is related to a topic for document-topic pairs (Tang and Lease, 2011). This dataset was part of the TREC 2011 crowdsourcing track; (5) The rating data set consists of ratings on a scale of 1 to 10 for products, and the collusive behaviors of workers are identified by obtaining the admission of colluding workers (Khuda Bukhsh et al., 2014). |
| Dataset Splits | No | For the real data sets, the paper states: "All available worker-generated labels are used to estimate the ground truth, and the tasks with ground true labels available are used to calculate the accuracy of the inference." This describes an evaluation strategy but does not specify train/test/validation splits for reproducing the model's training process. |
| Hardware Specification | Yes | The numerical studies were conducted on a virtual machine with an Intel Xeon E5-2693V3 16-core 2.30-GHz processor and 32 GB RAM. |
| Software Dependencies | No | The paper mentions statistical methods and algorithms like adaptive LASSO, EM algorithm, and coordinate descent, but does not specify any programming languages, libraries, or software packages with version numbers. |
| Experiment Setup | Yes | The ground true labels yt for each task are randomly generated with the marginal probability m = [0.6, 0.4]T . In addition, we consider 10 workers. If working independently, each worker has a confusion matrix of a = [0.7 0.3; 0.3 0.7]. The first k workers belong to a colluding group with a colluding probability of h for each task. Specifically, with a probability of h, the k workers collude on a task, and they generate the same label according to a confusion matrix of b = [ρ 1-ρ; 1-ρ ρ] otherwise they generate the labels independently according to their own confusion matrices. ... Specifically, we consider scenarios with k = 3 and k = 5... and we consider ρ = 0.7, 0.5, 0.3, and 0... In each scenario, we consider two different colluding probabilities including h = 0.5 and h = 1... For each scenario with a certain number of tasks, we replicate the simulation for 100 times. ... we initialize ai and m in the same way when implementing the algorithms proposed in Section 5. For Hi,j, we initialize H0 i,j = 0.5. |