reproducibilityindex.ai

HyperFast: Instant Classification for Tabular Data

Authors: David Bonet, Daniel Mas Montserrat, Xavier Giró-i-Nieto, Alexander G. Ioannidis

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We report extensive experiments with Open ML and genomic data, comparing Hyper Fast to competing tabular data neural networks, traditional ML methods, Auto ML systems, and boosting machines.
Researcher Affiliation	Collaboration	1Stanford University, Stanford, CA, USA 2Universitat Polit ecnica de Catalunya, Barcelona, Spain 3Amazon, Barcelona, Spain
Pseudocode	No	The paper includes architectural diagrams (e.g., Figure 1) but does not provide any pseudocode or algorithm blocks.
Open Source Code	Yes	Our code, which offers a scikit-learn-like interface, along with the trained Hyper Fast model, can be found at https://github.com/AI-sandbox/Hyper Fast.
Open Datasets	Yes	We use the 70 tabular datasets from the Open ML-CC18 suite (Bischl et al. 2021) which, to the best of our knowledge, is the largest and most used standardized tabular dataset benchmark, composed of standard classification datasets (e.g., Breast Cancer, Bank Marketing). ... We also include tabular genomics datasets sourced from distinct biobanks. Specifically, we utilize genome sequences of dogs (Bartusiak et al. 2022) for dog clade (group of breeds) prediction in meta-training, European (British) humans from the UK Biobank (UKB) (Sudlow et al. 2015) for phenotype prediction in meta-validation, and Hap Map3 (Consortium et al. 2010) for subpopulation prediction in the meta-test.
Dataset Splits	Yes	The collection of Open ML datasets is randomly shuffled and divided into meta-training, meta-validation and metatesting sets, with a 75%-10%-15% split, respectively.
Hardware Specification	No	The paper mentions that "Time results are shown for a single GPU" in Table 1, and "GPU training is possible for the model", but it does not specify any particular GPU model (e.g., NVIDIA A100, RTX 2080 Ti), CPU model, or other detailed hardware specifications used for the experiments.
Software Dependencies	No	The paper mentions various software components and libraries like "scikit-learn (Pedregosa et al. 2011)", XGBoost, Light GBM, Cat Boost, Auto-Sklearn, Auto Gluon, SAINT, Tab PFN, NODE, FT-Transformer, and T2G-Former. However, it does not provide specific version numbers for these software dependencies, which is required for reproducibility.
Experiment Setup	Yes	We set a Random Features projection to 32 768 (215) features, sampled from a normal distribution following the He initialization (He et al. 2015), followed by a Re LU activation. ... Then, we keep the principal components (PCs) associated to the 784 largest eigenvalues... After the PCA at 4σ. ... As a shared module we use 2 feed-forward layers with a hidden size of 1024 and Re LU activations. For the main network, we consider a 3-layer MLP with a residual connection (He et al. 2016), and a main network hidden size equal to the number of PCs (784 dimensions). ... A maximum batch size of 2048 samples is used for training...