Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Guarding against Spurious Discoveries in High Dimensions

Authors: Jianqing Fan, Wen-Xin Zhou

JMLR 2016 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The theory and method are convincingly illustrated by simulated examples and an application to the binary outcomes from German Neuroblastoma Trials. [...] First we ran a simulation study to examine how accurate the Gaussian approximation R2 0(s, p) is to the generalized likelihood ratio statistic 2LRn(s, p) in the null model. [...] In this section, we conduct a moderate scale simulation study to examine how effective the multiplier bootstrap quantile qn,α(s, p) serves as a benchmark for judging whether the discovery is spurious. [...] In this section, we apply the idea of detecting spurious discoveries to the neuroblastoma data reported in Oberthuer et al. (2006).
Researcher Affiliation Academia Jianqing Fan EMAIL Department of Operations Research and Financial Engineering Princeton Univeristy Princeton, NJ 08544, USA. Wen-Xin Zhou EMAIL Department of Operations Research and Financial Engineering Princeton Univeristy Princeton, NJ 08544, USA.
Pseudocode Yes 2.3 An LAMM algorithm The computation of the best subset regression coefficient bβ(s) in (4) requires solving a combinatorial optimization problem with a cardinality constraint, and therefore is NP-hard. In the following, we suggest a fast and easily implementable method, which combines the forward selection (stepwise addition) algorithm and a local adaptive majorization-minimization (LAMM) algorithm (Lange, Hunter and Yang, 2000; Fan et al., 2015) to provide an approximate solution. Our optimization problem is minβ Rp: β 0 s f(β), where f(β) = Ln(β). We say that a function g(β | β(k)) majorizes f(β) at the point β(k) if f(β(k)) = g(β(k) | β(k)) and f(β) g(β | β(k)) for all β Rp. An majorization-minimization (MM) algorithm initializes at β(0) and then iteratively computes β(k+1) = argminβ Rp: β 0 s g(β | β(k)). [...] We propose to use the stepwise forward selection algorithm to compute an initial estimator bβ(0). [...] Starting from a prespecified value Ī» = Ī»0, we successfully inflate Ī» by a factor ρ > 1. After the ā„“th iteration, Ī» = λℓ= ρℓ 1Ī»0. We take the first ā„“such that f( bβ(k+1) λℓ ) gλℓ( bβ(k+1) λℓ | bβ(k)) and set bβ(k+1) = bβ(k+1) λℓ .
Open Source Code No The paper does not provide explicit access information (e.g., a specific repository link, an explicit code release statement, or code in supplementary materials) for the methodology described. It refers to a previous paper for computational complexity analysis, and states that certain attempts are
Open Datasets Yes As an illustration, Fan, Shao and Zhou (2015) considered a real data example using the gene expression data from the international Hap Map project (Thorisson et al., 2005). [...] To gain further insights, let us illustrate the issue by using the gene expression profiles for 10, 707 genes from 251 patients in the German Neuroblastoma Trials NB90-NB2004 (Oberthuer et al., 2006).
Dataset Splits Yes We apply Lasso using the logistic regression model with tuning parameter selected via ten-fold cross validation (40 genes are selected). [...] The response labeled as 3-year event-free survival (3-year EFS) is a binary outcome indicating whether each patient survived 3 years after the diagnosis of neuroblastoma. Excluding five outlier arrays, there are 246 subjects (101 females and 145 males) with 3-year EFS information available. Among them, 56 are positives and 190 are negatives.
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper mentions the use of "Lasso using the logistic regression model" but does not specify any software names with version numbers for libraries or solvers used.
Experiment Setup Yes For the logistic regression model with tuning parameter selected via ten-fold cross validation (40 genes are selected). [...] The results reported here are based on 200 simulations with the ambient dimension p = 400 and the sample size n taken values in {120, 160, 200}. [...] We take α = 0.1 and compute the empirical SDP based on 200 simulations. For each simulated data set, qn,α(s, p)|s=bscv, p=400 is computed based on 1000 bootstrap replications.