Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Guarding against Spurious Discoveries in High Dimensions
Authors: Jianqing Fan, Wen-Xin Zhou
JMLR 2016 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The theory and method are convincingly illustrated by simulated examples and an application to the binary outcomes from German Neuroblastoma Trials. [...] First we ran a simulation study to examine how accurate the Gaussian approximation R2 0(s, p) is to the generalized likelihood ratio statistic 2LRn(s, p) in the null model. [...] In this section, we conduct a moderate scale simulation study to examine how eļ¬ective the multiplier bootstrap quantile qn,α(s, p) serves as a benchmark for judging whether the discovery is spurious. [...] In this section, we apply the idea of detecting spurious discoveries to the neuroblastoma data reported in Oberthuer et al. (2006). |
| Researcher Affiliation | Academia | Jianqing Fan EMAIL Department of Operations Research and Financial Engineering Princeton Univeristy Princeton, NJ 08544, USA. Wen-Xin Zhou EMAIL Department of Operations Research and Financial Engineering Princeton Univeristy Princeton, NJ 08544, USA. |
| Pseudocode | Yes | 2.3 An LAMM algorithm The computation of the best subset regression coeļ¬cient bβ(s) in (4) requires solving a combinatorial optimization problem with a cardinality constraint, and therefore is NP-hard. In the following, we suggest a fast and easily implementable method, which combines the forward selection (stepwise addition) algorithm and a local adaptive majorization-minimization (LAMM) algorithm (Lange, Hunter and Yang, 2000; Fan et al., 2015) to provide an approximate solution. Our optimization problem is minβ Rp: β 0 s f(β), where f(β) = Ln(β). We say that a function g(β | β(k)) majorizes f(β) at the point β(k) if f(β(k)) = g(β(k) | β(k)) and f(β) g(β | β(k)) for all β Rp. An majorization-minimization (MM) algorithm initializes at β(0) and then iteratively computes β(k+1) = argminβ Rp: β 0 s g(β | β(k)). [...] We propose to use the stepwise forward selection algorithm to compute an initial estimator bβ(0). [...] Starting from a prespeciļ¬ed value Ī» = Ī»0, we successfully inļ¬ate Ī» by a factor Ļ > 1. After the āth iteration, Ī» = Ī»ā= Ļā 1Ī»0. We take the ļ¬rst āsuch that f( bβ(k+1) Ī»ā ) gĪ»ā( bβ(k+1) Ī»ā | bβ(k)) and set bβ(k+1) = bβ(k+1) Ī»ā . |
| Open Source Code | No | The paper does not provide explicit access information (e.g., a specific repository link, an explicit code release statement, or code in supplementary materials) for the methodology described. It refers to a previous paper for computational complexity analysis, and states that certain attempts are |
| Open Datasets | Yes | As an illustration, Fan, Shao and Zhou (2015) considered a real data example using the gene expression data from the international Hap Map project (Thorisson et al., 2005). [...] To gain further insights, let us illustrate the issue by using the gene expression proļ¬les for 10, 707 genes from 251 patients in the German Neuroblastoma Trials NB90-NB2004 (Oberthuer et al., 2006). |
| Dataset Splits | Yes | We apply Lasso using the logistic regression model with tuning parameter selected via ten-fold cross validation (40 genes are selected). [...] The response labeled as 3-year event-free survival (3-year EFS) is a binary outcome indicating whether each patient survived 3 years after the diagnosis of neuroblastoma. Excluding ļ¬ve outlier arrays, there are 246 subjects (101 females and 145 males) with 3-year EFS information available. Among them, 56 are positives and 190 are negatives. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper mentions the use of "Lasso using the logistic regression model" but does not specify any software names with version numbers for libraries or solvers used. |
| Experiment Setup | Yes | For the logistic regression model with tuning parameter selected via ten-fold cross validation (40 genes are selected). [...] The results reported here are based on 200 simulations with the ambient dimension p = 400 and the sample size n taken values in {120, 160, 200}. [...] We take α = 0.1 and compute the empirical SDP based on 200 simulations. For each simulated data set, qn,α(s, p)|s=bscv, p=400 is computed based on 1000 bootstrap replications. |