A Theoretical Analysis of First Heuristics of Crowdsourced Entity Resolution
Authors: Arya Mazumdar, Barna Saha
AAAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we make the first attempt to close this gap. We provide a thorough analysis of the prominent heuristic algorithms for crowd-based ER. We justify experimental observations with our analysis and information theoretic lower bounds. Moreover, we conduct a thorough experiment on the bibliographical cora (Mc Callum 2004) dataset for ER and several synthetic datasets to validate the theoretical findings further. |
| Researcher Affiliation | Academia | Arya Mazumdar and Barna Saha College of Information & Computer Sciences University of Massachusetts Amherst {arya,barna}@cs.umass.edu |
| Pseudocode | No | The paper describes algorithms in prose but does not provide structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. |
| Open Datasets | Yes | We used the widely used cora (Mc Callum 2004) dataset for ER. |
| Dataset Splits | No | The paper mentions using "cora" and "synthetic datasets" but does not provide specific details on training, validation, or test data splits. |
| Hardware Specification | No | The paper does not provide specific hardware details used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers. |
| Experiment Setup | Yes | We created multiple synthetic datasets each containing 1200 nodes and 14 clusters with the following size distribution: two clusters of size 200, four clusters of size 100, eight clusters of size 50, two clusters each of size 30 and 20 and the rest of the clusters of size 10. The datasets differed in the way similarity values are generated by varying ϵ and sampling the values either from Dist-1 or Dist-2. The similarity values are further discretized to take values from the set {0, 0.1, 0.2, ..., 0.9, 1}. We used the similarity function as in (Whang, Lofgren, and Garcia-Molina 2013; Wang et al. 2013; Vesdapunt, Bellare, and Dalvi 2014; Firmani, Saha, and Srivastava 2016). |