Learning Semantic Annotations for Tabular Data

Authors: Jiaoyan Chen, Ernesto Jimenez-Ruiz, Ian Horrocks, Charles Sutton

IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our technique using the DBpedia KB and three table sets: T2Dv2 from the general Web, Limaye and Efthymiou from the Wikipedia encyclopedia. As well as testing single table sets, the evaluation specially considers the generalization (transferability) of the prediction model from one table set to another. The evaluation suggests that our method is effective and that its overall accuracy is higher than the state-of-the-art in most cases.
Researcher Affiliation Academia Jiaoyan Chen1 , Ernesto Jim enez-Ruiz2,4 , Ian Horrocks1,2 and Charles Sutton2,3 1Department of Computer Science, University of Oxford, UK 2The Alan Turing Institute, London, UK 3School of Informatics, The University of Edinburgh, UK 4Department of Informatics, University of Oslo, Norway
Pseudocode Yes Algorithm 1: P2Vec Extract (L, L), P , N, α
Open Source Code Yes Codes: https://github.com/alan-turing-institute/SemAIDA
Open Datasets Yes In the evaluation1 conducted in this paper we rely on DBpedia and three web table sets: T2Dv22 from the general Web, Limaye [Limaye et al., 2010] and Efthymiou [Efthymiou et al., 2017] from the Wikipedia encyclopedia.
Dataset Splits No The paper states 'T2Dv2 is randomly split into T2D-Tr (70%) and T2D-Te (30%).' for training and testing, but does not explicitly mention a separate validation split percentage or specific amount.
Hardware Specification No The paper does not provide specific details on the hardware used for experiments (e.g., CPU, GPU models, or memory).
Software Dependencies No The paper mentions 'Adam' and 'word2vec model' as well as 'DBpedia lookup service' and 'SPARQL endpoint', but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes Regarding the micro table, the number of rows m is set to 5, the number of surrounding columns l is set to 4, and zero-padding is used for tables that do not have enough columns or rows. In training, negative samples are constructed by labeling the entity column with each wrong class; a word2vec model [Mikolov et al., 2013] trained by the latest dump of Wikipedia articles is adopted. HNN is trained by Adam [Kingma and Ba, 2014] with the loss function of softmax cross entropy. The hidden size and the attention layer size of RNN are set to 150 and 50, the column Conv filter set Θ1 and the row Conv filter set Θ2 are set to {2, 3, 4} and {2, 3}, the feature number per filter (κ1 and κ2) is set to 32. In computing P2Vec, the DBpedia lookup service and SPARQL endpoint are used, while the hyper parameters σ, N and α are set to 0.005, 5 and 0.85 respectively.