reproducibilityindex.ai

Accounting for Confirmation Bias in Crowdsourced Label Aggregation

Authors: Meric Altug Gemalmaz, Ming Yin

IJCAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Evaluations on real-world crowd annotations show that the proposed bias-aware label aggregation algorithm outperforms baseline methods in accurately inferring the ground-truth labels of different tasks when crowd workers indeed exhibit some degree of conﬁrmation bias. Through simulations on synthetic data, we further identify the conditions when the proposed algorithm has the largest advantages over baseline methods.
Researcher Affiliation	Academia	Meric Altug Gemalmaz , Ming Yin Purdue University {mgemalma, mingyin}@purdue.edu
Pseudocode	No	The paper includes Figure 1 which is a probabilistic graphical model, but no structured pseudocode or algorithm blocks.
Open Source Code	No	The paper states, 'We implemented these baseline algorithms using the open-sourced code repository provided by Zheng et al. [2017].' However, it does not provide any statement or link for the open-sourcing of their own proposed methodology's code.
Open Datasets	No	The paper states 'we ﬁrst collected a set of annotations generated by real crowd workers on the subjective task of differentiating factual statements from opinion statements' and 'we generated synthetic datasets of worker annotations'. However, it does not provide concrete access information (e.g., a link, DOI, or explicit statement of public availability) for either the collected or synthetic datasets.
Dataset Splits	Yes	In addition to making inference using the entire set of annotations from all 110 workers, to see how the accuracy of the inference varies with the number of annotators, we also randomly sampled annotations from K (K {20, 50, 80}) workers and inferred the ground-truth label for each statement using only the subset of annotations provided by these K workers. For each K, we repeated the random sampling process for 100 times, and the average accuracy of the inferred labels across 100 trials is presented in Figure 2 for each algorithm.
Hardware Specification	No	The paper does not provide specific details about the hardware used for running its experiments.
Software Dependencies	No	The paper mentions 'open-sourced code repository provided by Zheng et al. [2017]' for baselines, but it does not specify any software dependencies with version numbers for their own implementation or other tools used.
Experiment Setup	Yes	To account for the impact of parameter initialization on the performance of the algorithm, we deployed an empirically effective heuristic to restart the EM algorithm. We ran EM for three times. For all three runs, we adopted a relatively uninformative initialization for pi, π, and a (pi = 0.5, π = 0.5, and a = 2). Then, in the ﬁrst EM, we initialized ci = 0.5 and initialized all statements values from one extreme (e.g., sj = 1), hoping that this run of EM would return an accurate ordering of ci. Then, in the second EM, we initialized sj = 0.5 and ci = 1, hoping to get an accurate ordering of sj. In the third EM, we initialized ci (sj) using the ﬁnal ci (sj) values from the ﬁrst (second) EM. In the end, we report the inference results from the EM run that gives the highest likelihood of the data. We then simulated a group of N = 25 workers by setting a = 2, sampling each worker s values uniformly randomly between 0 and 1 (i.e., ci U[0, 1]), and setting pi Beta(1, β). For each value of β, we generated 50 synthetic datasets by simulating worker s annotation on each task according to Eqn. 1.