Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

A Novel Strategy for Active Task Assignment in Crowd Labeling

Authors: Zehong Hu, Jie Zhang

IJCAI 2018 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments based on four popular worker models and four MTurk datasets. The empirical results show that our strategy not only requires the least labels for high label accuracy but also achieves the highest computation efﬁciency among all existing prediction-based strategies.
Researcher Affiliation	Collaboration	Zehong Hu, Jie Zhang Rolls-Royce@NTU Corporate Lab, School of Computer Science and Engineering Nanyang Technological University, Singapore EMAIL
Pseudocode	Yes	Algorithm 1: Active Task Assignment
Open Source Code	No	The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets	Yes	We employ four popular MTurk datasets as our testbeds in Figure 2. HCB dataset contains the judgments on the relevance between Web pages and search queries [Buckley et al., 2010]. RTE dataset workers need to check whether a hypothesis sentence can be inferred from the provided sentence [Snow et al., 2008]. SPE dataset consists of the positive or negative labels of movie reviews [Pang and Lee, 2005]. ACC dataset workers classify websites according to their adult contents [Ipeirotis et al., 2010].
Dataset Splits	No	The paper describes collecting 'T labels' and uses 'accuracy A(t)' for evaluation but does not specify explicit training/validation/test dataset splits or their percentages.
Hardware Specification	Yes	The time cost is estimated via running 100 rounds of experiments on Xeon CPU E5-1650 and collecting M N labels in each round of experiments.
Software Dependencies	Yes	Thus, we employ the famous SQUARE library (Version 2.0) to complement those nonexistent labels [Sheshadri and Lease, 2013].
Experiment Setup	Yes	In Figures 1a-d, we set the numbers of tasks and workers as 500 and 10, respectively. In Figure 3, we compare different settings of the parameters, γ(t), a and b, in our strategy. In Figure 3a, we employ the function family γ(t) = 1 exp( c t) to test the effects of changing risk level func-tions γ(t) on label accuracy.