Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Statistical Decision Making for Optimal Budget Allocation in Crowd Labeling
Authors: Xi Chen, Qihang Lin, Dengyong Zhou
JMLR 2015 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The experiments on both simulated and real data show that our policy achieves a higher labeling quality than other existing policies at the same budget level. |
| Researcher Affiliation | Collaboration | Xi Chen EMAIL Stern School of Business New York University New York, New York, 10012, USA Qihang Lin EMAIL Tippie College of Business University of Iowa Iowa City, Iowa, 52242, USA Dengyong Zhou EMAIL Microsoft Research Redmond, Washington, 98052, USA |
| Pseudocode | Yes | Algorithm 1 Optimistic Knowledge Gradient Algorithm 2 Optimistic Knowledge Gradient for Heterogeneous Workers |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. It thanks other researchers for sharing their code for comparison methods, but does not state that the authors' own implementation code is available. |
| Open Datasets | Yes | We compare different policies on a standard real data set for recognizing textual entailment (RTE) (Section 4.3 in Snow et al., 2008). |
| Dataset Splits | No | The paper mentions a real dataset with 800 instances but does not specify any training/test/validation splits needed for reproduction. It describes the data collection process (e.g., 10 different workers per instance) but not how the dataset was partitioned for experimentation. |
| Hardware Specification | No | The paper makes no specific mention of hardware used for running experiments, such as GPU/CPU models, memory, or cloud instances. It only refers to "CPU time" in a comparison table without providing hardware details. |
| Software Dependencies | No | The paper does not list specific software dependencies with version numbers. While it implicitly uses Python for its experiments, no specific libraries or their versions are mentioned. |
| Experiment Setup | Yes | For each simulated experiment, we randomly generate 20 different sets of data and report the averaged accuracy. The deviations for different methods are similar and quite small and thus omitted for the purpose of better visualization and space-saving. We first simulate K = 50 instances with each θi ∼ Beta(0.5, 0.5), θi ∼ Beta(2, 2), θi ∼ Beta(2, 1) or θi ∼ Beta(4, 1). The density functions of these four different Beta distributions are plotted in Figure 6. For each generating distribution of θi, we compare Opt-KG using the uniform prior (Beta(1, 1)) (in red line) to Opt-KG with the true generating distribution as the prior (in blue line). The comparison in accuracy with different levels of budget (T = 2K, . . . , 20K) is shown in Figure 7. |