Variable Importance in High-Dimensional Settings Requires Grouping

Authors: Ahmad Chamma, Bertrand Thirion, Denis Engemann

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive benchmarks on synthetic and real world data (section 4) which demonstrate the capacity of the proposed method to combine high prediction performance with theoretically grounded identification of predicatively important groups of variables.
Researcher Affiliation Collaboration Ahmad Chamma1, 2, 3, Bertrand Thirion1, 2, 3*, Denis Engemann4* 1 Inria-Saclay, Palaiseau, France 2 Universite Paris Saclay 3 CEA Saclay 4 Roche Pharma Research and Early Development, Neuroscience and Rare Diseases, Roche Innovation Center Basel, F. Hoffmann La Roche Ltd., Basel, Switzerland
Pseudocode No The paper provides block diagrams and mathematical formulations, but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a code-like format.
Open Source Code Yes We provide publicly available code (compatible with the Scikit-learn API) on Git Hub (https://github.com/ achamma723/Group Variable Importance).
Open Datasets Yes The UK Biobank project (UKBB) encompasses imaging and socio-demographic derived phenotypes from a prospective cohort of participants drawn from the population of the UK (Constantinescu et al. 2022; Littlejohns et al. 2020). We accessed the UKBB data through its controlled access scheme in accordance with its institutional ethics boards (Bycroft et al. 2018; Sudlow et al. 2015).
Dataset Splits Yes Across the paper, we rely on an i.i.d. sampling train/validation/test partition scheme where the n samples are divided into ntrain training and ntest test samples. [...] The default behavior consists of a 2-fold internal cross validation where the importance inference is performed on an unseen test set. [...] We used 10-fold cross validation with significance estimation and refitting the reduced model using the training set while scoring with the reduced model on the testing set.
Hardware Specification No The paper discusses computation time and mentions running on '100 cores' but does not provide specific hardware details such as GPU or CPU models, processor types, or memory specifications used for the experiments.
Software Dependencies No The paper mentions compatibility with the 'Scikit-learn API' and cites 'scikit-learn (Pedregosa et al. 2011)', but it does not provide specific version numbers for scikit-learn or any other software dependencies like Python, PyTorch, or TensorFlow.
Experiment Setup No The paper mentions using a 2-fold internal cross-validation and performing '100 runs' for experiments, and describes the models used (DNN and Random Forest) as well as the data generation process for synthetic data. However, it does not provide specific hyperparameters such as learning rate, batch size, number of epochs, or optimizer settings for the DNN model, nor does it include a dedicated table or paragraph detailing the full experimental setup.