Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Fast Randomized Kernel Ridge Regression with Statistical Guarantees
Authors: Ahmed Alaoui, Michael W. Mahoney
NeurIPS 2015 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We give an empirical evidence supporting this fact. Our second contribution is to present a fast algorithm to quickly compute coarse approximations to these scores in time linear in the number of samples. We test our results based on several datasets: one synthetic regression problem from [3] to illustrate the importance of the λ-ridge leverage scores, the Pumadyn family consisting of three datasets pumadyn-32fm, pumadyn-32fh and pumadyn-32nh 5 and the Gas Sensor Array Drift Dataset from the UCI database6. |
| Researcher Affiliation | Academia | Ahmed El Alaoui Michael W. Mahoney Electrical Engineering and Computer Sciences Statistics and International Computer Science Institute University of California, Berkeley, Berkeley, CA 94720. {elalaoui@eecs,mmahoney@stat}.berkeley.edu |
| Pseudocode | Yes | Inputs: data points (xi)1 i n, probability vector (pi)1 i n, sampling parameter p {1, 2, }, λ > 0, ϵ (0, 1/2). Output: ( li)1 i n ϵ-approximations to (li(λ))1 i n. 1. Sample p data points from (xi)1 i n with replacement with probabilities (pi)1 i n. 2. Compute the corresponding columns K1, , Kp of the kernel matrix. 3. Construct C = [K1, , Kp] Rn p and W Rp p as presented in Section 2. 4. Construct B Rn p such that BB = CW C . 5. For every i {1, , n}, set li = B i (B B + nλI) 1Bi (9) where Bi is the i-th row of B, and return it. |
| Open Source Code | No | The paper does not provide any explicit statements about releasing source code or links to a code repository. |
| Open Datasets | Yes | We test our results based on several datasets: one synthetic regression problem from [3] to illustrate the importance of the λ-ridge leverage scores, the Pumadyn family consisting of three datasets pumadyn-32fm, pumadyn-32fh and pumadyn-32nh 5 and the Gas Sensor Array Drift Dataset from the UCI database6. (5http://www.cs.toronto.edu/ delve/data/pumadyn/desc.html 6https://archive.ics.uci.edu/ml/datasets/Gas+Sensor+Array+Drift+Dataset) |
| Dataset Splits | Yes | For all datasets, we determine λ and the band width of k by cross validation, and we compute the effective dimensionality deff and the maximal degrees of freedom dmof. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software dependencies or libraries used in the experiments. |
| Experiment Setup | Yes | For all datasets, we determine λ and the band width of k by cross validation, and we compute the effective dimensionality deff and the maximal degrees of freedom dmof. Table 1 summarizes the experiments. It is often the case that deff dmof and R( ˆf L)/R( ˆf K) 1, in agreement with Theorem 3. The synthetic case consists of a regression problem on the interval X = [0, 1] where, given a sequence (xi)1 i n and a sequence of noise (ϵi)1 i n, we observe the sequence yi = f(xi) + σ2ϵi, i {1, , n}. The number of observations is n = 500. |