Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Toward a Robust and Universal Crowd-Labeling Framework
Authors: Faiza Khan Khattak
IJCAI 2016 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show empirically that our approaches are robust even in the presence of a large proportion of low-quality labelers in the crowd (Figure 1). Furthermore, we derive a lower bound of the number of expert labels needed [Khattak and Salleb-Aouissi, 2013]. Figure 1: UCI Chess Dataset [Asuncion and Newman, 2007]. Accuracy of Majority voting, GLAD (with and without clamping) [Whitehill et al., 2009], Majority voting, Dawid and Skene [Dawid and Skene, 1979], EM (Expectation Maximization), Karger s iterative method [Karger et al., 2014], Mean Field algorithm and BP [Liu et al., 2012] and ELICE (all versions and variants) with 20 expert-labeled instances. Good labelers: 0-35% mistakes, Random labelers: 35-65% mistakes, Malicious labelers: 65-100% mistakes. Accuracy vs. percentage of random and malicious labelers averaged over 50 runs. |
| Researcher Affiliation | Academia | Faiza Khan Khattak Columbia University, New York EMAIL |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any information or links regarding the availability of open-source code for the described methodology. |
| Open Datasets | Yes | Figure 1: UCI Chess Dataset [Asuncion and Newman, 2007]. |
| Dataset Splits | No | The paper mentions using "expert-labeled instances (ground truth) for a small percentage of data to learn the parameters" (e.g., 0.1%-10% of the dataset, or 20 instances for Figure 1), but it does not specify a general train/validation/test dataset split for the entire dataset used in evaluation, such as specific percentages or sample counts for validation or training sets. |
| Hardware Specification | No | The paper does not specify any hardware details (e.g., CPU, GPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper does not provide specific software names with version numbers or other ancillary software details. |
| Experiment Setup | No | The paper describes the proposed methods and how parameters are estimated but does not provide specific experimental setup details such as hyperparameters (e.g., learning rates, batch sizes) or training configurations. |