Sub-Merge: Diving Down to the Attribute-Value Level in Statistical Schema Matching

Authors: Zhe Lim, Benjamin Rubinstein1791

AAAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the superior statistical and computational performance of multiple sparse CCA compared to a suite of baseline algorithms, on two datasets which we are releasing to stimulate further research.
Researcher Affiliation Collaboration Zhe Lim and Benjamin I. P. Rubinstein Department of Computing and Information Systems The University of Melbourne, Australia zhe@infinitelooplabs.com brubinstein@unimelb.edu.au
Pseudocode Yes Algorithm 1 One Vs All CCA post-processing
Open Source Code No The paper states: “To foster further research on this problem, we are releasing with this paper two new manually-labeled datasets1...”. Footnote 1 provides a URL for the datasets, but not for the source code of the methodology described in the paper.
Open Datasets Yes To foster further research on this problem, we are releasing with this paper two new manually-labeled datasets1, constructed by multiple web crawls and crowd-sourced annotation. 1Datasets at http://people.eng.unimelb.edu.au/brubinstein/data
Dataset Splits No The paper mentions “crossvalidation to model select regularization terms” but does not specify the explicit training, validation, or test dataset splits (e.g., percentages or sample counts) used for its experiments.
Hardware Specification Yes We measure runtime on a PC with a 2.3GHz Intel Core i7 processor & 8GB of memory.
Software Dependencies No The paper does not provide specific version numbers for software dependencies or libraries used in the implementation of the described methods.
Experiment Setup Yes We perform binary search to set L1 penalties for a desired level of sparsity, and crossvalidation to model select regularization terms (Witten, Tibshirani, and Hastie 2009). An important task is to determine the number of principal components. A natural approach is via the scree plot: eigenvalues by rank. The retained components can be set by thresholding the eigenvalues which correspond to correlations under discovered components or by identifying a knee in the curve.