Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SHGR: A Generalized Maximal Correlation Coefficient

Authors: Samuel Stocksieker, Denys Pommeret

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive numerical experiments and feature selection tasks confirm that SHGR outperforms existing state-of-the-art methods. We validate SHGR on synthetic and real-world tabular datasets.
Researcher Affiliation Academia Samuel Stocksieker CNRS, I2M Aix Marseille University Marseille, France Denys Pommeret CNRS, I2M Aix Marseille University Marseille, France
Pseudocode Yes E SHGR Algorithm In addition to figures 1a and 1b, figure 9 shows an example of architecture on a set of 3 variables. Based on neural networks, the SHGR algorithm is defined in two stages: one for building the architecture and the other for training the model. At each epoch, the correlation is measured on the inputs (in whole), and the model retains the model with the lowest loss (i.e. the highest correlation) on the inputs. If the results no longer improve (to within an epsilon) during a given number of iterations, then learning stops. Algorithm 1: train_SHGR: Training of the SHGR model
Open Source Code Yes Code and data are available at: https://github.com/sstocksieker/SHGR
Open Datasets Yes We validate SHGR on synthetic and real-world tabular datasets. In Section G: Real-World Applications, the paper lists specific datasets used, such as Abalone, Air Quality, Appliance, Boston, Concrete, etc., with URLs and citations to public repositories like UCI Machine Learning Repository.
Dataset Splits Yes For each method, we select the top k features most correlated with the target y, and assess predictive performance via RMSE on a test set (30%), using a random forest regressor. ... Predictions are made on a test set (30% of the original dataset), randomly sampled, using a random forest model. The training set consists of at most 500 observations for all datasets (to assess the robustness of the methods to sampling).
Hardware Specification Yes The computations were performed on a personal desktop computer with the following specifications: NVIDIA Ge Force RTX 4080 graphics card, 64GB of memory (but the memory usage did not exceed 30GB), Intel i9-14900KF processor.
Software Dependencies No The paper lists various software libraries used (e.g., 'python library numpy', 'R library Alter Corr', 'python package maxcorr', 'python library HSIC') but does not provide specific version numbers for these components. For example, it mentions 'numpy' but not 'numpy 1.23.0'.
Experiment Setup Yes F.2.1 Hyperparameter For the illustration, we have chosen the following hyperparameters: epoch number: 200 maximum batch size: 64 learning rate: 10e 3 hidden layer dimensions : [64, 32, 16, 8] epsilon for early stopping : 0.5 iteration max for patience early stopping: 20 penalization for differentiable ranks (as defined in [4]): 1 α power parameter in SHGR loss function: 2.0