Ridge Regression and Provable Deterministic Ridge Leverage Score Sampling
Authors: Shannon McCurdy
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide a biological data illustration of ridge leverage scores and ridge regression with multi-omic data from lower-grade glioma (LGG) tumor samples collected by the TCGA Research Network (http://cancergenome.nih.gov/). Our real-data illustration makes a strong case for the empirical usefulness of the DRLS algorithm and bounds. The real data exhibits striking power law decay of the ridge leverage scores (Figure 7), justifying the assumptions underlying the use of DRLS sampling (Theorem 5). |
| Researcher Affiliation | Academia | Shannon R. Mc Curdy California Institute for Quantitative Biosciences UC Berkeley Berkeley, CA 94702 smccurdy@berkeley.edu |
| Pseudocode | Yes | Algorithm 1. The DRLS algorithm selects for the submatrix C all columns i with ridge leverage score τi(A) above a threshold θ, determined by the error tolerance ϵ. This algorithm is deeply indebted to the deterministic algorithm of Papailiopoulos et al. (2014). It substitutes ridge leverage scores for rank-k subspace scores, and has a different stopping parameter. The algorithm is as follows. |
| Open Source Code | Yes | Software in the form of python and R code is available at https://github.com/srmcc/ deterministic-ridge-leverage-sampling. |
| Open Datasets | Yes | We provide a biological data illustration of ridge leverage scores and ridge regression with multi-omic data from lower-grade glioma (LGG) tumor samples collected by the TCGA Research Network (http://cancergenome.nih.gov/). We download the data using the R tool TCGA2STAT (Wan et al., 2016). The data collection and data platforms are discussed in detail in the original paper (The Cancer Genome Atlas Research Network, 2015). |
| Dataset Splits | No | The paper does not explicitly specify a training, validation, and test split for the LGG multi-omic data. While it describes sample sizes (e.g., '274 tumor samples'), it does not detail how these samples were partitioned for training, validation, or testing purposes to enable reproducibility of data partitioning. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory, or cloud instances) used to conduct the experiments. |
| Software Dependencies | No | The paper mentions 'R tool TCGA2STAT' and 'CNtools (Zhang, 2015) that is imbedded in TCGA2STAT. R package version 1.26.0'. While it provides a version for CNtools, it does not list multiple key software components with their specific version numbers (e.g., versions for R, Python, or other libraries beyond one specific R package), which is required for a reproducible software description. |
| Experiment Setup | Yes | We choose k = 3 for the DRLS algorithm because these components are meaningful for the 'IDH' and 'codel' outcome variables (see Figures 3, 4 , and 5). Applying the DRLS algorithm with k = 3, ϵ = 0.1 leads to |Θ| = 1512, selecting approximately 0.02% of the total multi-omic features for the column subset matrix C. We simulate 274 samples y according to the linear model (Eqn. 4), where y = Ax , the coefficients x N(0, I), and A is the LGG multi-omic feature matrix. We choose σ2 = {10 3, 1, 103}. |