Towards Knowledge-Driven Annotation

Authors: Yassine Mrabet, Claire Gardent, Muriel Foulonneau, Elena Simperl, Eric Ras

AAAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In the proposed method, we represent the reference knowledge bases as co-occurrence matrices and the disambiguation problem as a 0-1 Integer Linear Programming (ILP) problem. The proposed approach is unsupervised and can be ported to any RDF knowledge base. The system implementing this approach, called KODA, shows very promising results w.r.t. state-of-the-art annotation tools in cross-domain experimentations. We then present our approach and compare the results obtained on 5 different benchmarks with those obtained by state-of-the-art systems.
Researcher Affiliation Academia Yassine Mrabet CRP Henri Tudor Luxembourg yassine.mrabet@tudor.lu Claire Gardent CNRS/LORIA Nancy, France claire.gardent@loria.fr Muriel Foulonneau CRP Henri Tudor Luxembourg muriel.foulonneau@tudor.lu Elena Simperl University of Southampton Southampton, United Kingdom e.simperl@soton.ac.uk Eric Ras CRP Henri Tudor Luxembourg eric.ras@tudor.lu On the 1st of January 2015, CRP Henri Tudor and CRP Gabriel Lippmann will merge to form the Luxembourg Institute of Science & Technology (http://www.list.lu)
Pseudocode No The paper describes the process and provides mathematical formulas (equations 1-4) but does not include a distinct pseudocode or algorithm block.
Open Source Code No The paper only provides a link to an online demonstration ('Online demonstration: http://smartdocs.tudor.lu/koda'), not the source code itself.
Open Datasets Yes We used 3 standard benchmarks from the literature... ACQUAINT benchmark (724 SFs) (Milne and Witten 2008)... MSNBC data (660 SFs)(Cucerzan 2007)... IITB corpus proposed in (Kulkarni et al. 2009). All corpora used in the experiments as well as the results obtained by KODA are available on the project website13. (footnote 13: http://smartdocs.tudor.lu/koda/datasets.html)
Dataset Splits No The paper describes the benchmarks and corpora used (e.g., 'ACQUAINT benchmark (724 SFs)', 'MSNBC data (660 SFs)', 'IITB corpus contains more than 19K manual annotations for 103 documents', 'Wiki News (404 annotated surface forms)', 'Pub Med1, consisting of 312 annotated SFs'), but it does not specify any training, validation, or test dataset splits (e.g., percentages, sample counts, or explicit mention of predefined splits).
Hardware Specification Yes The construction of the DBpedia matrix lasted 5.4 hours with relevant partitioning and optimizations of a DBMS (experimental configuration: My SQL Inno DB with 8 Gb RAM and 4-cores laptop).
Software Dependencies Yes We used the 0-1 ILP solver Lp Solve10 as it provided both acceptable performance and accessible implementation. (footnote 10: http://lpsolve.sourceforge.net/5.5/)
Experiment Setup Yes KODA s results were obtained by setting the number of first items (resources) returned by Sol R queries to 40 for all SFs and the HA threshold to 9/20.