Variable Importance in High-Dimensional Settings Requires Grouping
Authors: Ahmad Chamma, Bertrand Thirion, Denis Engemann
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive benchmarks on synthetic and real world data (section 4) which demonstrate the capacity of the proposed method to combine high prediction performance with theoretically grounded identification of predicatively important groups of variables. |
| Researcher Affiliation | Collaboration | Ahmad Chamma1, 2, 3, Bertrand Thirion1, 2, 3*, Denis Engemann4* 1 Inria-Saclay, Palaiseau, France 2 Universite Paris Saclay 3 CEA Saclay 4 Roche Pharma Research and Early Development, Neuroscience and Rare Diseases, Roche Innovation Center Basel, F. Hoffmann La Roche Ltd., Basel, Switzerland |
| Pseudocode | No | The paper provides block diagrams and mathematical formulations, but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a code-like format. |
| Open Source Code | Yes | We provide publicly available code (compatible with the Scikit-learn API) on Git Hub (https://github.com/ achamma723/Group Variable Importance). |
| Open Datasets | Yes | The UK Biobank project (UKBB) encompasses imaging and socio-demographic derived phenotypes from a prospective cohort of participants drawn from the population of the UK (Constantinescu et al. 2022; Littlejohns et al. 2020). We accessed the UKBB data through its controlled access scheme in accordance with its institutional ethics boards (Bycroft et al. 2018; Sudlow et al. 2015). |
| Dataset Splits | Yes | Across the paper, we rely on an i.i.d. sampling train/validation/test partition scheme where the n samples are divided into ntrain training and ntest test samples. [...] The default behavior consists of a 2-fold internal cross validation where the importance inference is performed on an unseen test set. [...] We used 10-fold cross validation with significance estimation and refitting the reduced model using the training set while scoring with the reduced model on the testing set. |
| Hardware Specification | No | The paper discusses computation time and mentions running on '100 cores' but does not provide specific hardware details such as GPU or CPU models, processor types, or memory specifications used for the experiments. |
| Software Dependencies | No | The paper mentions compatibility with the 'Scikit-learn API' and cites 'scikit-learn (Pedregosa et al. 2011)', but it does not provide specific version numbers for scikit-learn or any other software dependencies like Python, PyTorch, or TensorFlow. |
| Experiment Setup | No | The paper mentions using a 2-fold internal cross-validation and performing '100 runs' for experiments, and describes the models used (DNN and Random Forest) as well as the data generation process for synthetic data. However, it does not provide specific hyperparameters such as learning rate, batch size, number of epochs, or optimizer settings for the DNN model, nor does it include a dedicated table or paragraph detailing the full experimental setup. |