Discovering Conditionally Salient Features with Statistical Guarantees

Authors: Jaime Roquero Gimenez, James Zou

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We implement this method and present an algorithm that automatically partitions the feature space such that it enhances the differences between selected sets in different regions, and validate the statistical theoretical results with experiments. (Abstract)Our main contributions of this paper are in laying out the new framework for conditional feature selection along with proposing a new knockoff algorithm with mathematical guarantees. We also validate the algorithm on experiments. (Our Contributions)We now run experiments and the first goal is to show that our main theorem holds. (Section 4)
Researcher Affiliation Academia 1Department of Statistics, Stanford University, Stanford USA 2Department of Biomedical Data Science, Stanford University, Stanford USA.
Pseudocode Yes Algorithm 1 Knockoffs with Local Importance Scores Feature Selection Procedure (Section 3.2)Algorithm 2 One-Step Greedy Feature Space Partition (Section 3.3)
Open Source Code No The paper does not contain any explicit statement about releasing open-source code for the described methodology, nor does it provide a link to a code repository.
Open Datasets No We therefore sample the datasets X, X so that X simulates the SNPs of a cohort of patients, i.e. a matrix of 0, 1, and 2. (Section 4)The paper describes a synthetic data generation process but does not indicate that the dataset used for experiments is publicly available, nor does it provide access information (link, DOI, citation) for a specific public dataset.
Dataset Splits No We vary the size of the dataset to show that our method still controls local FDR even though the number of points in a given subregion is very limited (Section 4).The paper mentions varying dataset size but does not specify train/validation/test splits, absolute sample counts for each split, or reference predefined splits with citations.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments.
Software Dependencies No The paper mentions using 'logistic regression' as part of the importance score calculation, but it does not specify any software dependencies with version numbers (e.g., Python, PyTorch, scikit-learn versions).
Experiment Setup No Our target FDR is q = 0.2 and locally we consider as importance scores the absolute values of the coefficients in a logistic regression. (Section 4)The paper provides some high-level experimental parameters like the target FDR level and the type of importance scores used, but it lacks specific hyperparameters (e.g., learning rate, batch size, number of epochs) or system-level training settings crucial for reproducibility.