reproducibilityindex.ai

Fast Randomized Kernel Ridge Regression with Statistical Guarantees

Authors: Ahmed Alaoui, Michael W. Mahoney

NeurIPS 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We give an empirical evidence supporting this fact. Our second contribution is to present a fast algorithm to quickly compute coarse approximations to these scores in time linear in the number of samples. We test our results based on several datasets: one synthetic regression problem from [3] to illustrate the importance of the λ-ridge leverage scores, the Pumadyn family consisting of three datasets pumadyn-32fm, pumadyn-32fh and pumadyn-32nh 5 and the Gas Sensor Array Drift Dataset from the UCI database6.
Researcher Affiliation	Academia	Ahmed El Alaoui Michael W. Mahoney Electrical Engineering and Computer Sciences Statistics and International Computer Science Institute University of California, Berkeley, Berkeley, CA 94720. {elalaoui@eecs,mmahoney@stat}.berkeley.edu
Pseudocode	Yes	Inputs: data points (xi)1 i n, probability vector (pi)1 i n, sampling parameter p {1, 2, }, λ > 0, ϵ (0, 1/2). Output: ( li)1 i n ϵ-approximations to (li(λ))1 i n. 1. Sample p data points from (xi)1 i n with replacement with probabilities (pi)1 i n. 2. Compute the corresponding columns K1, , Kp of the kernel matrix. 3. Construct C = [K1, , Kp] Rn p and W Rp p as presented in Section 2. 4. Construct B Rn p such that BB = CW C . 5. For every i {1, , n}, set li = B i (B B + nλI) 1Bi (9) where Bi is the i-th row of B, and return it.
Open Source Code	No	The paper does not provide any explicit statements about releasing source code or links to a code repository.
Open Datasets	Yes	We test our results based on several datasets: one synthetic regression problem from [3] to illustrate the importance of the λ-ridge leverage scores, the Pumadyn family consisting of three datasets pumadyn-32fm, pumadyn-32fh and pumadyn-32nh 5 and the Gas Sensor Array Drift Dataset from the UCI database6. (5http://www.cs.toronto.edu/ delve/data/pumadyn/desc.html 6https://archive.ics.uci.edu/ml/datasets/Gas+Sensor+Array+Drift+Dataset)
Dataset Splits	Yes	For all datasets, we determine λ and the band width of k by cross validation, and we compute the effective dimensionality deff and the maximal degrees of freedom dmof.
Hardware Specification	No	The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory) used for running the experiments.
Software Dependencies	No	The paper does not provide specific version numbers for any software dependencies or libraries used in the experiments.
Experiment Setup	Yes	For all datasets, we determine λ and the band width of k by cross validation, and we compute the effective dimensionality deff and the maximal degrees of freedom dmof. Table 1 summarizes the experiments. It is often the case that deff dmof and R( ˆf L)/R( ˆf K) 1, in agreement with Theorem 3. The synthetic case consists of a regression problem on the interval X = [0, 1] where, given a sequence (xi)1 i n and a sequence of noise (ϵi)1 i n, we observe the sequence yi = f(xi) + σ2ϵi, i {1, , n}. The number of observations is n = 500.