A Conditional Randomization Test for Sparse Logistic Regression in High-Dimension

Authors: Binh T. Nguyen, Bertrand Thirion, Sylvain Arlot

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide a theoretical analysis of this procedure, and demonstrate its effectiveness on simulations, along with experiments on large-scale brain-imaging and genomics datasets.
Researcher Affiliation Academia Binh T. Nguyen LTCI, Télecom Paris, IP Paris tuanbinhs@gmail.com Bertrand Thirion Université Paris-Saclay, Inria, CEA, Palaiseau 91120, France bertrand.thirion@inria.fr Sylvain Arlot Université Paris-Saclay, CNRS, Inria, Laboratoire de mathématiques d Orsay, 91405, Orsay, France sylvain.arlot@universite-paris-saclay.fr
Pseudocode Yes A summary of the full procedure, which we call CRT-logit, can be found in Algorithm 1.
Open Source Code Yes Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes]
Open Datasets Yes The Human Connectome Project dataset (HCP) is a collection of brain imaging data... The last in our benchmark is a Genome-wide Association Study (GWAS) on the The Cancer Genome Atlas (TCGA) dataset [30, 31]. [30] Suhas V Vasaikar, Peter Straub, Jing Wang, and Bing Zhang. Linked Omics: analyzing multi-omics data within and across 32 cancer types. Nucleic Acids Research, 46(D1):D956 D963, January 2018. [31] John N Weinstein, Eric A Collisson, Gordon B Mills, Kenna R Mills Shaw, Brad A Ozenberger, Kyle Ellrott, Ilya Shmulevich, Chris Sander, and Joshua M Stuart. The cancer genome atlas pan-cancer analysis project. Nature genetics, 45(10):1113 1120, 2013.
Dataset Splits No The paper describes the generation of synthetic data, pre-processing steps for real datasets (e.g., clustering variables), and refers to using cross-validation for parameter tuning (e.g., 'In general, we advise to use cross-validation for obtaining ˆβλ in Eq. (2) and for X ,j-distillation operator, as defined by Eq. (10)'). However, it does not provide specific train/validation/test split percentages, sample counts, or explicit instructions for reproducible data partitioning of the datasets used in the experiments.
Hardware Specification No The paper states in its checklist: 'Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [No]'. The main text does not specify any particular GPU or CPU models, memory details, or cloud computing resources used for the experiments.
Software Dependencies No While the paper references software such as 'scikit-learn' in its bibliography [23], it does not provide specific version numbers for any software libraries, frameworks, or environments used in its experimental setup, which is necessary for reproducibility.
Experiment Setup Yes Setting ℓ1-regularization parameter λ and λdx In general, we advise to use cross-validation for obtaining ˆβλ in Eq. (2) and for X ,j-distillation operator, as defined by Eq. (10)... The true signal β0 is picked with a sparsity parameter κ = s /p that controls the proportion of non-zero elements with magnitude 2.0, i.e. βj = 2.0 for all j S. For the specific purpose of this experiment, non-zero indices of S are kept fixed. The noise ξ is i.i.d. normal N(0, Idn) with magnitude σ = Xβ0 2/( n SNR), controlled by the SNR parameter. In short, the three main parameters controlling this simulation are correlation ρ, sparsity degree κ and signal-to-noise ratio SNR. Default parameter: n = 400, p = 600, SNR = 2.0, ρ = 0.5, κ = 0.04. FDR is controlled at level α = 0.1.