Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Statistical Decision Making for Optimal Budget Allocation in Crowd Labeling

Authors: Xi Chen, Qihang Lin, Dengyong Zhou

JMLR 2015 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The experiments on both simulated and real data show that our policy achieves a higher labeling quality than other existing policies at the same budget level.
Researcher Affiliation	Collaboration	Xi Chen EMAIL Stern School of Business New York University New York, New York, 10012, USA Qihang Lin EMAIL Tippie College of Business University of Iowa Iowa City, Iowa, 52242, USA Dengyong Zhou EMAIL Microsoft Research Redmond, Washington, 98052, USA
Pseudocode	Yes	Algorithm 1 Optimistic Knowledge Gradient Algorithm 2 Optimistic Knowledge Gradient for Heterogeneous Workers
Open Source Code	No	The paper does not provide concrete access to source code for the methodology described. It thanks other researchers for sharing their code for comparison methods, but does not state that the authors' own implementation code is available.
Open Datasets	Yes	We compare different policies on a standard real data set for recognizing textual entailment (RTE) (Section 4.3 in Snow et al., 2008).
Dataset Splits	No	The paper mentions a real dataset with 800 instances but does not specify any training/test/validation splits needed for reproduction. It describes the data collection process (e.g., 10 different workers per instance) but not how the dataset was partitioned for experimentation.
Hardware Specification	No	The paper makes no specific mention of hardware used for running experiments, such as GPU/CPU models, memory, or cloud instances. It only refers to "CPU time" in a comparison table without providing hardware details.
Software Dependencies	No	The paper does not list specific software dependencies with version numbers. While it implicitly uses Python for its experiments, no specific libraries or their versions are mentioned.
Experiment Setup	Yes	For each simulated experiment, we randomly generate 20 different sets of data and report the averaged accuracy. The deviations for different methods are similar and quite small and thus omitted for the purpose of better visualization and space-saving. We first simulate K = 50 instances with each θi ∼ Beta(0.5, 0.5), θi ∼ Beta(2, 2), θi ∼ Beta(2, 1) or θi ∼ Beta(4, 1). The density functions of these four different Beta distributions are plotted in Figure 6. For each generating distribution of θi, we compare Opt-KG using the uniform prior (Beta(1, 1)) (in red line) to Opt-KG with the true generating distribution as the prior (in blue line). The comparison in accuracy with different levels of budget (T = 2K, . . . , 20K) is shown in Figure 7.