Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Double or Nothing: Multiplicative Incentive Mechanisms for Crowdsourcing

Authors: Nihar B. Shah, Dengyong Zhou

JMLR 2016 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we present synthetic simulations and real-world experiments to evaluate the effects of our setting and our mechanism on the ﬁnal label quality. We conducted preliminary experiments on the Amazon Mechanical Turk commercial crowdsourcing platform (mturk.com) to evaluate our proposed scheme in real-world scenarios.
Researcher Affiliation	Collaboration	Nihar B. Shah EMAIL Department of Electrical Engineering and Computer Sciences University of California, Berkeley Berkeley, CA 94720 USA Dengyong Zhou EMAIL Machine Learning Department Microsoft Research One Microsoft Way, Redmond 98052 USA
Pseudocode	Yes	Algorithm 1: Incentive mechanism for skip-based setting Algorithm 2: Incentive mechanism for the conﬁdence-based setting
Open Source Code	No	The paper does not provide concrete access to source code for the methodology described. It states: "The complete data, including the interface presented to the workers in each of the tasks, the results obtained from the workers, and the ground truth solutions, are available on the website of the ﬁrst author.", which refers to data, not code.
Open Datasets	Yes	The complete data, including the interface presented to the workers in each of the tasks, the results obtained from the workers, and the ground truth solutions, are available on the website of the ﬁrst author. This task required workers to identify the breeds of dogs shown in 85 images (source of images: Khosla et al. (2011); Deng et al. (2009)). This task required the workers to identify the textures shown in 24 grayscale images (source of images: Lazebnik et al. (2005, Data set 1: Textured surfaces)).
Dataset Splits	No	The paper describes how gold standard questions are distributed randomly among N questions but does not provide specific training/test/validation splits for model training or evaluation. For example: "The G gold standard questions are assumed to be distributed uniformly at random in the pool of N questions (of course, the worker does not know which G of the N questions form the gold standard)."
Hardware Specification	No	The paper mentions experiments conducted on the "Amazon Mechanical Turk commercial crowdsourcing platform (mturk.com)" but does not specify any particular hardware (e.g., GPU/CPU models, memory) used for running experiments or simulations.
Software Dependencies	No	The paper does not provide specific software dependencies or versions used for implementation or experimentation (e.g., Python, PyTorch, or other libraries with version numbers).
Experiment Setup	Yes	In this set of simulations, we set T = 0.75. We compared (a) the baseline mechanism with 5 cents for each correct answer in the gold standard, (b) the skip-based mechanism with κ = 5.9 and 1/T = 1.5, and (c) the conﬁdence-based mechanism with κ = 5.9 cents, L = 2, α2 = 1.5, α1 = 1.4, α0 = 1, α-1 = 0.5, α-2 = 0.