A Theoretical Analysis of First Heuristics of Crowdsourced Entity Resolution

Authors: Arya Mazumdar, Barna Saha

AAAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we make the first attempt to close this gap. We provide a thorough analysis of the prominent heuristic algorithms for crowd-based ER. We justify experimental observations with our analysis and information theoretic lower bounds. Moreover, we conduct a thorough experiment on the bibliographical cora (Mc Callum 2004) dataset for ER and several synthetic datasets to validate the theoretical findings further.
Researcher Affiliation Academia Arya Mazumdar and Barna Saha College of Information & Computer Sciences University of Massachusetts Amherst {arya,barna}@cs.umass.edu
Pseudocode No The paper describes algorithms in prose but does not provide structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide concrete access to source code for the methodology described.
Open Datasets Yes We used the widely used cora (Mc Callum 2004) dataset for ER.
Dataset Splits No The paper mentions using "cora" and "synthetic datasets" but does not provide specific details on training, validation, or test data splits.
Hardware Specification No The paper does not provide specific hardware details used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers.
Experiment Setup Yes We created multiple synthetic datasets each containing 1200 nodes and 14 clusters with the following size distribution: two clusters of size 200, four clusters of size 100, eight clusters of size 50, two clusters each of size 30 and 20 and the rest of the clusters of size 10. The datasets differed in the way similarity values are generated by varying ϵ and sampling the values either from Dist-1 or Dist-2. The similarity values are further discretized to take values from the set {0, 0.1, 0.2, ..., 0.9, 1}. We used the similarity function as in (Whang, Lofgren, and Garcia-Molina 2013; Wang et al. 2013; Vesdapunt, Bellare, and Dalvi 2014; Firmani, Saha, and Srivastava 2016).