Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Guarding against Spurious Discoveries in High Dimensions
Authors: Jianqing Fan, Wen-Xin Zhou
JMLR 2016 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The theory and method are convincingly illustrated by simulated examples and an application to the binary outcomes from German Neuroblastoma Trials. [...] First we ran a simulation study to examine how accurate the Gaussian approximation R2 0(s, p) is to the generalized likelihood ratio statistic 2LRn(s, p) in the null model. [...] In this section, we conduct a moderate scale simulation study to examine how eļ¬ective the multiplier bootstrap quantile qn,α(s, p) serves as a benchmark for judging whether the discovery is spurious. [...] In this section, we apply the idea of detecting spurious discoveries to the neuroblastoma data reported in Oberthuer et al. (2006). |
| Researcher Affiliation | Academia | Jianqing Fan EMAIL Department of Operations Research and Financial Engineering Princeton Univeristy Princeton, NJ 08544, USA. Wen-Xin Zhou EMAIL Department of Operations Research and Financial Engineering Princeton Univeristy Princeton, NJ 08544, USA. |
| Pseudocode | Yes | 2.3 An LAMM algorithm The computation of the best subset regression coeļ¬cient bβ(s) in (4) requires solving a combinatorial optimization problem with a cardinality constraint, and therefore is NP-hard. In the following, we suggest a fast and easily implementable method, which combines the forward selection (stepwise addition) algorithm and a local adaptive majorization-minimization (LAMM) algorithm (Lange, Hunter and Yang, 2000; Fan et al., 2015) to provide an approximate solution. Our optimization problem is minβ Rp: β 0 s f(β), where f(β) = Ln(β). We say that a function g(β | β(k)) majorizes f(β) at the point β(k) if f(β(k)) = g(β(k) | β(k)) and f(β) g(β | β(k)) for all β Rp. An majorization-minimization (MM) algorithm initializes at β(0) and then iteratively computes β(k+1) = argminβ Rp: β 0 s g(β | β(k)). [...] We propose to use the stepwise forward selection algorithm to compute an initial estimator bβ(0). [...] Starting from a prespeciļ¬ed value Ī» = Ī»0, we successfully inļ¬ate Ī» by a factor Ļ > 1. After the āth iteration, Ī» = Ī»ā= Ļā 1Ī»0. We take the ļ¬rst āsuch that f( bβ(k+1) Ī»ā ) gĪ»ā( bβ(k+1) Ī»ā | bβ(k)) and set bβ(k+1) = bβ(k+1) Ī»ā . |
| Open Source Code | No | The paper does not provide explicit access information (e.g., a specific repository link, an explicit code release statement, or code in supplementary materials) for the methodology described. It refers to a previous paper for computational complexity analysis, and states that certain attempts are |
| Open Datasets | Yes | As an illustration, Fan, Shao and Zhou (2015) considered a real data example using the gene expression data from the international Hap Map project (Thorisson et al., 2005). [...] To gain further insights, let us illustrate the issue by using the gene expression proļ¬les for 10, 707 genes from 251 patients in the German Neuroblastoma Trials NB90-NB2004 (Oberthuer et al., 2006). |
| Dataset Splits | Yes | We apply Lasso using the logistic regression model with tuning parameter selected via ten-fold cross validation (40 genes are selected). [...] The response labeled as 3-year event-free survival (3-year EFS) is a binary outcome indicating whether each patient survived 3 years after the diagnosis of neuroblastoma. Excluding ļ¬ve outlier arrays, there are 246 subjects (101 females and 145 males) with 3-year EFS information available. Among them, 56 are positives and 190 are negatives. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper mentions the use of "Lasso using the logistic regression model" but does not specify any software names with version numbers for libraries or solvers used. |
| Experiment Setup | Yes | For the logistic regression model with tuning parameter selected via ten-fold cross validation (40 genes are selected). [...] The results reported here are based on 200 simulations with the ambient dimension p = 400 and the sample size n taken values in {120, 160, 200}. [...] We take α = 0.1 and compute the empirical SDP based on 200 simulations. For each simulated data set, qn,α(s, p)|s=bscv, p=400 is computed based on 1000 bootstrap replications. |