Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Identifying Unreliable and Adversarial Workers in Crowdsourced Labeling Tasks
Authors: Srikanth Jagabathula, Lakshminarayanan Subramanian, Ashwin Venkataraman
JMLR 2017 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our work makes algorithmic, theoretical, and empirical contributions: Theoretically, we show that our algorithms successfully identify unreliable honest workers, workers adopting deterministic strategies, and worst-case sophisticated adversaries... Empirically, we show that filtering out outliers using our algorithms can significantly improve the accuracy of several state-of-the-art label aggregation algorithms in real-world crowdsourcing datasets. We conducted two numerical studies to demonstrate the practical value of our methods. |
| Researcher Affiliation | Academia | Srikanth Jagabathula EMAIL Department of Information, Operations, and Management Sciences Leonard N. Stern School of Business New York University, NY 10012, USA. Lakshminarayanan Subramanian EMAIL Department of Computer Science Courant Institute of Mathematical Sciences New York University, NY 10012, USA. Ashwin Venkataraman EMAIL Department of Computer Science Courant Institute of Mathematical Sciences New York University, NY 10012, USA |
| Pseudocode | Yes | Algorithm 1 soft penalty, Algorithm 2 hard penalty, Algorithm 3 penalty-based label aggregation |
| Open Source Code | No | The paper does not provide an explicit statement about the release of its source code or a link to a code repository. It mentions using a third-party library, 'Python networkx library', but this is not the authors' own implementation code. |
| Open Datasets | Yes | We focused on the following standard datasets: stage2 and task2: consisting of a collection of topic-document pairs labeled as relevant or non-relevant by workers on Amazon Mechanical Turk (see Tang and Lease, 2011). rte and temp: consisting of annotations by Amazon Mechanical Turk workers for different natural language processing (NLP) tasks... (see Snow et al., 2008). tweets: consisting of sentiment (positive or negative) labels for 1000 tweets (see Mozafari et al., 2014). |
| Dataset Splits | No | The paper describes the generation of synthetic data in Section 5.2, including parameters for worker honesty probability (q=0.7), task prevalence (γ=0.5), and worker reliability distribution (µw u.a.r from [0.8, 1.0)). However, it does not explicitly provide details about training/test/validation splits for either the synthetic data or the real-world datasets used. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running its experiments, such as GPU models, CPU types, or memory specifications. |
| Software Dependencies | No | The paper mentions using the 'Python networkx library' in Section 5.2 but does not provide a specific version number for this library or for Python itself. The instructions require specific version numbers for software dependencies. |
| Experiment Setup | Yes | Section 5.2, 'Setup of study', details parameters for the simulation, such as 'n = 100 workers', 'probability q that a worker is honest was set to 0.7', 'prevalence γ of +1 tasks was set to 0.5', 'worker degrees according to a power-law distribution (with exponent a = 2.5) with the minimum degree equal to 5', and 'reliability µw u.a.r from the interval [0.8, 1.0)'. Additionally, Section 5.1 states 'We chose kmax = 100 in our experiments' for the KOS algorithm and describes an iterative worker removal process. |