Ridge Regression and Provable Deterministic Ridge Leverage Score Sampling

Authors: Shannon McCurdy

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide a biological data illustration of ridge leverage scores and ridge regression with multi-omic data from lower-grade glioma (LGG) tumor samples collected by the TCGA Research Network (http://cancergenome.nih.gov/). Our real-data illustration makes a strong case for the empirical usefulness of the DRLS algorithm and bounds. The real data exhibits striking power law decay of the ridge leverage scores (Figure 7), justifying the assumptions underlying the use of DRLS sampling (Theorem 5).
Researcher Affiliation Academia Shannon R. Mc Curdy California Institute for Quantitative Biosciences UC Berkeley Berkeley, CA 94702 smccurdy@berkeley.edu
Pseudocode Yes Algorithm 1. The DRLS algorithm selects for the submatrix C all columns i with ridge leverage score τi(A) above a threshold θ, determined by the error tolerance ϵ. This algorithm is deeply indebted to the deterministic algorithm of Papailiopoulos et al. (2014). It substitutes ridge leverage scores for rank-k subspace scores, and has a different stopping parameter. The algorithm is as follows.
Open Source Code Yes Software in the form of python and R code is available at https://github.com/srmcc/ deterministic-ridge-leverage-sampling.
Open Datasets Yes We provide a biological data illustration of ridge leverage scores and ridge regression with multi-omic data from lower-grade glioma (LGG) tumor samples collected by the TCGA Research Network (http://cancergenome.nih.gov/). We download the data using the R tool TCGA2STAT (Wan et al., 2016). The data collection and data platforms are discussed in detail in the original paper (The Cancer Genome Atlas Research Network, 2015).
Dataset Splits No The paper does not explicitly specify a training, validation, and test split for the LGG multi-omic data. While it describes sample sizes (e.g., '274 tumor samples'), it does not detail how these samples were partitioned for training, validation, or testing purposes to enable reproducibility of data partitioning.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory, or cloud instances) used to conduct the experiments.
Software Dependencies No The paper mentions 'R tool TCGA2STAT' and 'CNtools (Zhang, 2015) that is imbedded in TCGA2STAT. R package version 1.26.0'. While it provides a version for CNtools, it does not list multiple key software components with their specific version numbers (e.g., versions for R, Python, or other libraries beyond one specific R package), which is required for a reproducible software description.
Experiment Setup Yes We choose k = 3 for the DRLS algorithm because these components are meaningful for the 'IDH' and 'codel' outcome variables (see Figures 3, 4 , and 5). Applying the DRLS algorithm with k = 3, ϵ = 0.1 leads to |Θ| = 1512, selecting approximately 0.02% of the total multi-omic features for the column subset matrix C. We simulate 274 samples y according to the linear model (Eqn. 4), where y = Ax , the coefficients x N(0, I), and A is the LGG multi-omic feature matrix. We choose σ2 = {10 3, 1, 103}.