Fast Randomized Kernel Ridge Regression with Statistical Guarantees

Authors: Ahmed Alaoui, Michael W. Mahoney

NeurIPS 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We give an empirical evidence supporting this fact. Our second contribution is to present a fast algorithm to quickly compute coarse approximations to these scores in time linear in the number of samples. We test our results based on several datasets: one synthetic regression problem from [3] to illustrate the importance of the λ-ridge leverage scores, the Pumadyn family consisting of three datasets pumadyn-32fm, pumadyn-32fh and pumadyn-32nh 5 and the Gas Sensor Array Drift Dataset from the UCI database6.
Researcher Affiliation Academia Ahmed El Alaoui Michael W. Mahoney Electrical Engineering and Computer Sciences Statistics and International Computer Science Institute University of California, Berkeley, Berkeley, CA 94720. {elalaoui@eecs,mmahoney@stat}.berkeley.edu
Pseudocode Yes Inputs: data points (xi)1 i n, probability vector (pi)1 i n, sampling parameter p {1, 2, }, λ > 0, ϵ (0, 1/2). Output: ( li)1 i n ϵ-approximations to (li(λ))1 i n. 1. Sample p data points from (xi)1 i n with replacement with probabilities (pi)1 i n. 2. Compute the corresponding columns K1, , Kp of the kernel matrix. 3. Construct C = [K1, , Kp] Rn p and W Rp p as presented in Section 2. 4. Construct B Rn p such that BB = CW C . 5. For every i {1, , n}, set li = B i (B B + nλI) 1Bi (9) where Bi is the i-th row of B, and return it.
Open Source Code No The paper does not provide any explicit statements about releasing source code or links to a code repository.
Open Datasets Yes We test our results based on several datasets: one synthetic regression problem from [3] to illustrate the importance of the λ-ridge leverage scores, the Pumadyn family consisting of three datasets pumadyn-32fm, pumadyn-32fh and pumadyn-32nh 5 and the Gas Sensor Array Drift Dataset from the UCI database6. (5http://www.cs.toronto.edu/ delve/data/pumadyn/desc.html 6https://archive.ics.uci.edu/ml/datasets/Gas+Sensor+Array+Drift+Dataset)
Dataset Splits Yes For all datasets, we determine λ and the band width of k by cross validation, and we compute the effective dimensionality deff and the maximal degrees of freedom dmof.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory) used for running the experiments.
Software Dependencies No The paper does not provide specific version numbers for any software dependencies or libraries used in the experiments.
Experiment Setup Yes For all datasets, we determine λ and the band width of k by cross validation, and we compute the effective dimensionality deff and the maximal degrees of freedom dmof. Table 1 summarizes the experiments. It is often the case that deff dmof and R( ˆf L)/R( ˆf K) 1, in agreement with Theorem 3. The synthetic case consists of a regression problem on the interval X = [0, 1] where, given a sequence (xi)1 i n and a sequence of noise (ϵi)1 i n, we observe the sequence yi = f(xi) + σ2ϵi, i {1, , n}. The number of observations is n = 500.