Discovering Conditionally Salient Features with Statistical Guarantees
Authors: Jaime Roquero Gimenez, James Zou
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We implement this method and present an algorithm that automatically partitions the feature space such that it enhances the differences between selected sets in different regions, and validate the statistical theoretical results with experiments. (Abstract)Our main contributions of this paper are in laying out the new framework for conditional feature selection along with proposing a new knockoff algorithm with mathematical guarantees. We also validate the algorithm on experiments. (Our Contributions)We now run experiments and the first goal is to show that our main theorem holds. (Section 4) |
| Researcher Affiliation | Academia | 1Department of Statistics, Stanford University, Stanford USA 2Department of Biomedical Data Science, Stanford University, Stanford USA. |
| Pseudocode | Yes | Algorithm 1 Knockoffs with Local Importance Scores Feature Selection Procedure (Section 3.2)Algorithm 2 One-Step Greedy Feature Space Partition (Section 3.3) |
| Open Source Code | No | The paper does not contain any explicit statement about releasing open-source code for the described methodology, nor does it provide a link to a code repository. |
| Open Datasets | No | We therefore sample the datasets X, X so that X simulates the SNPs of a cohort of patients, i.e. a matrix of 0, 1, and 2. (Section 4)The paper describes a synthetic data generation process but does not indicate that the dataset used for experiments is publicly available, nor does it provide access information (link, DOI, citation) for a specific public dataset. |
| Dataset Splits | No | We vary the size of the dataset to show that our method still controls local FDR even though the number of points in a given subregion is very limited (Section 4).The paper mentions varying dataset size but does not specify train/validation/test splits, absolute sample counts for each split, or reference predefined splits with citations. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments. |
| Software Dependencies | No | The paper mentions using 'logistic regression' as part of the importance score calculation, but it does not specify any software dependencies with version numbers (e.g., Python, PyTorch, scikit-learn versions). |
| Experiment Setup | No | Our target FDR is q = 0.2 and locally we consider as importance scores the absolute values of the coefficients in a logistic regression. (Section 4)The paper provides some high-level experimental parameters like the target FDR level and the type of importance scores used, but it lacks specific hyperparameters (e.g., learning rate, batch size, number of epochs) or system-level training settings crucial for reproducibility. |