HyperFast: Instant Classification for Tabular Data
Authors: David Bonet, Daniel Mas Montserrat, Xavier Giró-i-Nieto, Alexander G. Ioannidis
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We report extensive experiments with Open ML and genomic data, comparing Hyper Fast to competing tabular data neural networks, traditional ML methods, Auto ML systems, and boosting machines. |
| Researcher Affiliation | Collaboration | 1Stanford University, Stanford, CA, USA 2Universitat Polit ecnica de Catalunya, Barcelona, Spain 3Amazon, Barcelona, Spain |
| Pseudocode | No | The paper includes architectural diagrams (e.g., Figure 1) but does not provide any pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code, which offers a scikit-learn-like interface, along with the trained Hyper Fast model, can be found at https://github.com/AI-sandbox/Hyper Fast. |
| Open Datasets | Yes | We use the 70 tabular datasets from the Open ML-CC18 suite (Bischl et al. 2021) which, to the best of our knowledge, is the largest and most used standardized tabular dataset benchmark, composed of standard classification datasets (e.g., Breast Cancer, Bank Marketing). ... We also include tabular genomics datasets sourced from distinct biobanks. Specifically, we utilize genome sequences of dogs (Bartusiak et al. 2022) for dog clade (group of breeds) prediction in meta-training, European (British) humans from the UK Biobank (UKB) (Sudlow et al. 2015) for phenotype prediction in meta-validation, and Hap Map3 (Consortium et al. 2010) for subpopulation prediction in the meta-test. |
| Dataset Splits | Yes | The collection of Open ML datasets is randomly shuffled and divided into meta-training, meta-validation and metatesting sets, with a 75%-10%-15% split, respectively. |
| Hardware Specification | No | The paper mentions that "Time results are shown for a single GPU" in Table 1, and "GPU training is possible for the model", but it does not specify any particular GPU model (e.g., NVIDIA A100, RTX 2080 Ti), CPU model, or other detailed hardware specifications used for the experiments. |
| Software Dependencies | No | The paper mentions various software components and libraries like "scikit-learn (Pedregosa et al. 2011)", XGBoost, Light GBM, Cat Boost, Auto-Sklearn, Auto Gluon, SAINT, Tab PFN, NODE, FT-Transformer, and T2G-Former. However, it does not provide specific version numbers for these software dependencies, which is required for reproducibility. |
| Experiment Setup | Yes | We set a Random Features projection to 32 768 (215) features, sampled from a normal distribution following the He initialization (He et al. 2015), followed by a Re LU activation. ... Then, we keep the principal components (PCs) associated to the 784 largest eigenvalues... After the PCA at 4σ. ... As a shared module we use 2 feed-forward layers with a hidden size of 1024 and Re LU activations. For the main network, we consider a 3-layer MLP with a residual connection (He et al. 2016), and a main network hidden size equal to the number of PCs (784 dimensions). ... A maximum batch size of 2048 samples is used for training... |