A Conditional Randomization Test for Sparse Logistic Regression in High-Dimension
Authors: Binh T. Nguyen, Bertrand Thirion, Sylvain Arlot
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide a theoretical analysis of this procedure, and demonstrate its effectiveness on simulations, along with experiments on large-scale brain-imaging and genomics datasets. |
| Researcher Affiliation | Academia | Binh T. Nguyen LTCI, Télecom Paris, IP Paris tuanbinhs@gmail.com Bertrand Thirion Université Paris-Saclay, Inria, CEA, Palaiseau 91120, France bertrand.thirion@inria.fr Sylvain Arlot Université Paris-Saclay, CNRS, Inria, Laboratoire de mathématiques d Orsay, 91405, Orsay, France sylvain.arlot@universite-paris-saclay.fr |
| Pseudocode | Yes | A summary of the full procedure, which we call CRT-logit, can be found in Algorithm 1. |
| Open Source Code | Yes | Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] |
| Open Datasets | Yes | The Human Connectome Project dataset (HCP) is a collection of brain imaging data... The last in our benchmark is a Genome-wide Association Study (GWAS) on the The Cancer Genome Atlas (TCGA) dataset [30, 31]. [30] Suhas V Vasaikar, Peter Straub, Jing Wang, and Bing Zhang. Linked Omics: analyzing multi-omics data within and across 32 cancer types. Nucleic Acids Research, 46(D1):D956 D963, January 2018. [31] John N Weinstein, Eric A Collisson, Gordon B Mills, Kenna R Mills Shaw, Brad A Ozenberger, Kyle Ellrott, Ilya Shmulevich, Chris Sander, and Joshua M Stuart. The cancer genome atlas pan-cancer analysis project. Nature genetics, 45(10):1113 1120, 2013. |
| Dataset Splits | No | The paper describes the generation of synthetic data, pre-processing steps for real datasets (e.g., clustering variables), and refers to using cross-validation for parameter tuning (e.g., 'In general, we advise to use cross-validation for obtaining ˆβλ in Eq. (2) and for X ,j-distillation operator, as defined by Eq. (10)'). However, it does not provide specific train/validation/test split percentages, sample counts, or explicit instructions for reproducible data partitioning of the datasets used in the experiments. |
| Hardware Specification | No | The paper states in its checklist: 'Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [No]'. The main text does not specify any particular GPU or CPU models, memory details, or cloud computing resources used for the experiments. |
| Software Dependencies | No | While the paper references software such as 'scikit-learn' in its bibliography [23], it does not provide specific version numbers for any software libraries, frameworks, or environments used in its experimental setup, which is necessary for reproducibility. |
| Experiment Setup | Yes | Setting ℓ1-regularization parameter λ and λdx In general, we advise to use cross-validation for obtaining ˆβλ in Eq. (2) and for X ,j-distillation operator, as defined by Eq. (10)... The true signal β0 is picked with a sparsity parameter κ = s /p that controls the proportion of non-zero elements with magnitude 2.0, i.e. βj = 2.0 for all j S. For the specific purpose of this experiment, non-zero indices of S are kept fixed. The noise ξ is i.i.d. normal N(0, Idn) with magnitude σ = Xβ0 2/( n SNR), controlled by the SNR parameter. In short, the three main parameters controlling this simulation are correlation ρ, sparsity degree κ and signal-to-noise ratio SNR. Default parameter: n = 400, p = 600, SNR = 2.0, ρ = 0.5, κ = 0.04. FDR is controlled at level α = 0.1. |