Efficient Approximation Algorithms for Strings Kernel Based Sequence Classification

Authors: Muhammad Farhan, Juvaria Tariq, Arif Zaman, Mudassir Shabbir, Imdad Ullah Khan

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our algorithm achieves excellent approximation performance with theoretical guarantees. In the process we solve an open combinatorial problem, which was posed as a major hindrance to the scalability of existing solutions. We give analytical bounds on quality and runtime of our algorithm and report its empirical performance on real world biological and music sequences datasets.
Researcher Affiliation Academia Muhammad Farhan Department of Computer Science School of Science and Engineering Lahore University of Management Sciences Lahore, Pakistan 14030031@lums.edu.pk Juvaria Tariq Department of Mathematics School of Science and Engineering Lahore University of Management Sciences Lahore, Pakistan jtariq@emory.edu Arif Zaman Department of Computer Science School of Science and Engineering Lahore University of Management Sciences Lahore, Pakistan arifz@lums.edu.pk Mudassir Shabbir Department of Computer Science Information Technology University Lahore, Pakistan mudassir.shabbir@itu.edu.pk Imdad Ullah Khan Department of Computer Science School of Science and Engineering Lahore University of Management Sciences Lahore, Pakistan imdad.khan@lums.edu.pk
Pseudocode Yes Algorithm 1 : Approximate-Kernel(SX,SY ,k,m,ϵ,δ,B)
Open Source Code Yes Sources for datasets and source code are available at 1. https://github.com/mufarhan/sequence_class_NIPS_2017
Open Datasets Yes Table 1: Datasets description Name Task Classes Seq. Av.Len. Evaluation Ding-Dubchak [6] protein fold recognition 27 694 169 10-fold CV SCOP [4, 31] protein homology detection 54 7329 308 54 binary class. Music [21, 26] music genre recognition 10 1000 2368 5-fold CV Artist20 [8, 17] artist identification 20 1413 9854 6-fold CV ISMIR [17] music genre recognition 6 729 10137 5-fold CV
Dataset Splits Yes Table 1: Datasets description Name Task Classes Seq. Av.Len. Evaluation Ding-Dubchak [6] protein fold recognition 27 694 169 10-fold CV SCOP [4, 31] protein homology detection 54 7329 308 54 binary class. Music [21, 26] music genre recognition 10 1000 2368 5-fold CV Artist20 [8, 17] artist identification 20 1413 9854 6-fold CV ISMIR [17] music genre recognition 6 729 10137 5-fold CV
Hardware Specification Yes These experiments are performed on an Intel Xeon machine with (8 Cores, 2.1 GHz and 32 GB RAM) using the same experimental settings as in [13, 15, 17]. Since our algorithm is applicable for significantly wider range of k and m, we also report classification performance with large k and m. For our algorithm we used B {300, 500} and σ {0.25, 0.5} with no significant difference in results as implied by the theoretical analysis. In all reported results B = 300 and σ = 0.5. In order to perform comparisons, for a few combinations of parameters we generated exact kernel matrices of each dataset on a much more powerful machine (a cluster of 20 nodes, each having 24 CPU s with 2.5 GHz speed and 128GB RAM).
Software Dependencies No The paper mentions using 'the publicly available SVM implementation LIBSVM [2]' but does not provide a specific version number for it or any other software dependencies.
Experiment Setup Yes For our algorithm we used B {300, 500} and σ {0.25, 0.5} with no significant difference in results as implied by the theoretical analysis. In all reported results B = 300 and σ = 0.5. [...] The parameters used for reporting classification performance are chosen in order to maintain comparability with previous studies. Similarly all measurements are made as in [13, 14], for instance for music genre classification we report results of 10-fold cross-validation (see Table 1). For our algorithm we used B = 300 and σ = 0.5 and we take an average of performances over three independent runs.