Tree Edit Distance Learning via Adaptive Symbol Embeddings
Authors: Benjamin Paaßen, Claudio Gallicchio, Alessio Micheli, Barbara Hammer
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments, we show that BEDL improves upon the state-of-the-art in metric learning for trees on six benchmark data sets, ranging from computer science over biomedical data to a natural-language processing data set containing over 300,000 nodes. |
| Researcher Affiliation | Academia | 1Cognitive Interaction Technology, Bielefeld University, Germany 2Department of Computer Science, University of Pisa, Italy. |
| Pseudocode | No | The paper mentions 'Computing this average over all cheapest edit scripts is possible efficiently via a novel forward-backward algorithm which we developed for this contribution (refer to the supplementary material; Paaßen (2018a)).' While an algorithm is mentioned, its pseudocode is referred to supplementary material and not present in the main paper. |
| Open Source Code | Yes | As implementations, we used custom implementations of KNN, MGLVQ, the goodness classifier, GESL, and BEDL, which are availabe at https://doi.org/10.4119/unibi/2919994. |
| Open Datasets | Yes | Cystic and Leukemia: Two data sets from KEGG/Glycan data base (Hashimoto et al., 2006) adapted from Gallicchio & Micheli (2013)... Sentiment: initialized the vectorial embedding with the 300-dimensional Common Crawl Glo Ve embedding (Pennington et al., 2014). |
| Dataset Splits | Yes | On each data set, we perform a crossvalidation1... We used 20 folds for Strings and Sentiment, 10 for Cystic and Leukemia, 8 for Sorting and 6 for Mini Palindrome. For the programming data sets, the number of folds had to be reduced to ensure that each fold still contained a meaningful number of data points. For the Cystic and Leukemia data set, our ten folds were consistent with the paper of Gallicchio & Micheli (2013). In all cases, folds were generated such that the label distribution of the overall data set was maintained. |
| Hardware Specification | Yes | All experiments were performed on a consumer-grade laptop with an Intel Core i7-7700 HQ CPU. |
| Software Dependencies | No | For SVM, we utilized the LIBSVM standard implementation (Chang & Lin, 2011). |
| Experiment Setup | Yes | We optimized all hyper-parameters in a nested 5-fold crossvalidation, namely the number of prototypes K for MGLVQ and LVQ metric learning in the range [1, 15], the number of neighbors for KNN in the range [1, 15], the kernel bandwidth for SVM in the range [0.1, 10], the sparsity parameter λ for the goodness classifier in the range [10 5, 10], and the regularization strength β for GESL and BEDL in the range 2 K m [10 6, 10 2]. |