reproducibilityindex.ai

Ridge Regression and Provable Deterministic Ridge Leverage Score Sampling

Authors: Shannon McCurdy

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We provide a biological data illustration of ridge leverage scores and ridge regression with multi-omic data from lower-grade glioma (LGG) tumor samples collected by the TCGA Research Network (http://cancergenome.nih.gov/). Our real-data illustration makes a strong case for the empirical usefulness of the DRLS algorithm and bounds. The real data exhibits striking power law decay of the ridge leverage scores (Figure 7), justifying the assumptions underlying the use of DRLS sampling (Theorem 5).
Researcher Affiliation	Academia	Shannon R. Mc Curdy California Institute for Quantitative Biosciences UC Berkeley Berkeley, CA 94702 smccurdy@berkeley.edu
Pseudocode	Yes	Algorithm 1. The DRLS algorithm selects for the submatrix C all columns i with ridge leverage score τi(A) above a threshold θ, determined by the error tolerance ϵ. This algorithm is deeply indebted to the deterministic algorithm of Papailiopoulos et al. (2014). It substitutes ridge leverage scores for rank-k subspace scores, and has a different stopping parameter. The algorithm is as follows.
Open Source Code	Yes	Software in the form of python and R code is available at https://github.com/srmcc/ deterministic-ridge-leverage-sampling.
Open Datasets	Yes	We provide a biological data illustration of ridge leverage scores and ridge regression with multi-omic data from lower-grade glioma (LGG) tumor samples collected by the TCGA Research Network (http://cancergenome.nih.gov/). We download the data using the R tool TCGA2STAT (Wan et al., 2016). The data collection and data platforms are discussed in detail in the original paper (The Cancer Genome Atlas Research Network, 2015).
Dataset Splits	No	The paper does not explicitly specify a training, validation, and test split for the LGG multi-omic data. While it describes sample sizes (e.g., '274 tumor samples'), it does not detail how these samples were partitioned for training, validation, or testing purposes to enable reproducibility of data partitioning.
Hardware Specification	No	The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory, or cloud instances) used to conduct the experiments.
Software Dependencies	No	The paper mentions 'R tool TCGA2STAT' and 'CNtools (Zhang, 2015) that is imbedded in TCGA2STAT. R package version 1.26.0'. While it provides a version for CNtools, it does not list multiple key software components with their specific version numbers (e.g., versions for R, Python, or other libraries beyond one specific R package), which is required for a reproducible software description.
Experiment Setup	Yes	We choose k = 3 for the DRLS algorithm because these components are meaningful for the 'IDH' and 'codel' outcome variables (see Figures 3, 4 , and 5). Applying the DRLS algorithm with k = 3, ϵ = 0.1 leads to \|Θ\| = 1512, selecting approximately 0.02% of the total multi-omic features for the column subset matrix C. We simulate 274 samples y according to the linear model (Eqn. 4), where y = Ax , the coefﬁcients x N(0, I), and A is the LGG multi-omic feature matrix. We choose σ2 = {10 3, 1, 103}.