Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Generalization Properties of hyper-RKHS and its Applications

Authors: Fanghui Liu, Lei Shi, Xiaolin Huang, Jie Yang, Johan A.K. Suykens

JMLR 2021 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimentally, results on several benchmarks suggest that the employed framework is able to learn a general kernel function form an arbitrary similarity matrix, and thus achieves a satisfactory performance on classification tasks. Keywords: hyper-RKHS, approximation theory, kernel learning, out-of-sample extensions. In Section 5 , we present numerical results on several benchmark datasets to verify the effectiveness of our two-stage kernel learning framework.
Researcher Affiliation Academia Fanghui Liu EMAIL Department of Electrical Engineering, ESAT-STADIUS, KU Leuven Kasteelpark Arenberg 10, Leuven, B-3001, Belgium Lei Shi EMAIL Shanghai Key Laboratory for Contemporary Applied Mathematics School of Mathematical Sciences, Fudan University Shanghai, 200433, China Xiaolin Huang EMAIL Jie Yang EMAIL Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University Institute of Medical Robotics, Shanghai Jiao Tong University Shanghai, 200240, China Johan A.K. Suykens EMAIL Department of Electrical Engineering, ESAT-STADIUS, KU Leuven Kasteelpark Arenberg 10, Leuven, B-3001, Belgium
Pseudocode Yes Algorithm 1: Divide-and-conquer with Nystr om approximation for KRR in hyper-RKHS
Open Source Code Yes The source code of our implementation can be found in http://www.lfhsgre.org.
Open Datasets Yes Here we carry out experiments to investigate the approximation performance for known kernels. The out-of-sample extension based algorithm (Pan et al., 2017) is taken into comparisons. This method solves a nonnegative least squares in hyper-RKHS, which can be regarded a special case of hyper-KRR. Nevertheless, we do not want to claim that the learned (indefinite) kernel in our framework is better than the PD one from (Pan et al., 2017). Instead, our target is to show the utility or flexibility of our framework. For fair comparison, these three algorithms in hyper-RKHS are associated with the same hyper-kernel, i.e., the hyper-Gaussian kernel used in this subsection. For the experiments on UCI data sets, the data points are partitioned into 40% labeled data, 40% unlabeled data, and 20% test data. The labeled and unlabeled data points form the training dataset. Such setting follows with (Pan et al., 2017), which simultaneously considers tranductive learning and inductive learning. Here the pre-given kernel matrix is generated by a known kernel including a positive definite one and an indefinite one. Learning on known kernels focuses on the approximation performance of the compared algorithms on these kernels. The used evaluation metric here is relative mean square error (RMSE) between the learned regression function k (x, x ) and the pre-given kernel matrix K over m2 pairwise data points. Besides, we also evaluate our kernel learning methods incorporated into SVM for classification. As a consequence, such experimental setting on known kernels help us to comprehensively investigate the approximation ability of the compared algorithms on PD or non-PD kernels. Both data sets are available at https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/. Last, for out-of-sample extensions, we apply our method to non-parametric kernel learning on the MNIST handwritten digits dataset (Lecun et al.).
Dataset Splits Yes For the experiments on UCI data sets, the data points are partitioned into 40% labeled data, 40% unlabeled data, and 20% test data. The labeled and unlabeled data points form the training dataset. Such setting follows with (Pan et al., 2017), which simultaneously considers tranductive learning and inductive learning. We split the training data {xi}m i=1 into v disjoint subsets {V1, V2, . . . , Vv}, and assume that the sample size of each partition is the same for simplicity, i.e., |V1| = |V2| = = |Vv| = n such that m = nv. The number of subsets is set to v = 5, 10, 20 on the ijcnn1 dataset, and v = 50, 100, 200 on the covtype dataset. Following (Pan et al., 2017), the number of Nystr om landmarks is set to M = 0.05m. Besides, we also include BMKL equipped with Gaussian kernels and polynomial kernels for comparison. Note that, Nystr om approximation on BMKL appears non-trivial, and thus we just incorporate BMKL into the divide-and-conquer framework and cooperate without the rankings. These kernel learning based algorithms are conducted by randomly picking 40% of the data for training and the rest for test.
Hardware Specification Yes The experiments implemented in MATLAB are conducted on a PC with Intel i7-8700K CPU (3.70 GHz) and 64 GB RAM.
Software Dependencies No The experiments implemented in MATLAB are conducted on a PC with Intel i7-8700K CPU (3.70 GHz) and 64 GB RAM. To detail our scalable scheme, we begin with KRR in hyper-RKHS with Nystr om approximation, and then present the divide-and-conquer strategy. To scale KRR in hyper-RKHS to large sample situations, the Nystr om scheme randomly selects a subset of M (often M m) training data {ex1, ex2, , ex M} {x1, x2, , xm}, termed as landmarks or centers, to approximate the original hyper-kernel matrix. Here the used sampling strategy can be uniform or advanced ones, e.g., leverage scores based sampling (Alaoui and Mahoney, 2015). The solution of KRR-Nystr om in hyper-RKHS via the used pairs {(exi, exj)}M i,j=1 is given by
Experiment Setup Yes During training, σ2 in the Gaussian hyper-kernel is set to the variance of data, and σ2 h is tuned via 5-fold cross validation over the values {0.25σ2, 0.5σ2, σ2, 2σ2, 4σ2}. The regularization parameters λ in KRR and C in SVR are searched on grids of log10 scale in the range of [10 5, 105]. The two slack variables ˆξij, ˇξij in SVR are set to 0.1 and 0.01, respectively.