CARTE: Pretraining and Transfer for Tabular Learning
Authors: Myung Jun Kim, Leo Grinsztajn, Gael Varoquaux
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | An extensive benchmark shows that CARTE facilitates learning, outperforming a solid set of baselines including the best tree-based models. [...] Section 4 provides an extensive empirical study across many tabular datasets, benchmarking the settings of a single downstream table as well as multiple related ones. |
| Researcher Affiliation | Collaboration | 1SODA Team, Inria Saclay, France 2Probabl.ai, France. Correspondence to: Myung Jun Kim <myung.kim@inria.fr>. |
| Pseudocode | No | The paper describes the CARTE model architecture and training procedures, but it does not include any pseudocode or algorithm blocks that are explicitly labeled as 'Pseudocode' or 'Algorithm'. |
| Open Source Code | Yes | The implementation and datasets will be avaliable at: Implementation: https://github.com/soda-inria/carte. |
| Open Datasets | Yes | The implementation and datasets will be avaliable at: [...] Datasets: https://huggingface.co/datasets/inria-soda/carte-benchmark. [...] We use 51 tabular learning datasets, all with an associated learning task 40 regressions and 11 classification , gathered across multiple sources, mainly from previous machine learning studies and kaggle competitions. [...] Appendix B gives the specific list of datasets. |
| Dataset Splits | Yes | To find the optimal hyperparameters of the baselines, 5-fold cross-validation over 100 random search iteration were carried out on all the comparing methods except for CARTE and Tab PFN. [...] We use early stopping on a validation set of the target table, and still rely on the bagging strategy of building multiple learners on different random validation sets and averaging the predictions. |
| Hardware Specification | Yes | A.3. Hardware Specifications: The pretrained model for CARTE was trained on GPUs. For rest of our experiments, they were run on 32 cores of CPU and the hardware was chosen based on availability. GPUs: NVIDIA V100 (32GB VRAM) CPUs: AMD EPYC 7742 64-Core Processor, AMD EPYC 7702 64-Core Processor (512GB RAM), Intel(R) Xeon(R) CPU E5-2660 v2, Intel(R) Xeon(R) Gold 6226R CPU (256GB RAM) |
| Software Dependencies | No | The paper mentions several software components like 'Adam W optimizer', 'Fast Text embeddings', 'intfloat/e5-small-v2 from Hugging Face', 'Cat Boost', 'XGBoost', and 'scikit-learn'. However, it does not provide specific version numbers for these software components, which is required for a reproducible description. |
| Experiment Setup | Yes | A.1. Pretrained Model of CARTE: We set 12 attention layers, each consisting of 12 multi-head attentions, and the hidden dimension was fixed to the same size as the inputs (300). The resulting model contains over 9.3 million parameters. To run the pretraining, we selected 128 entities with one additional positive, resulting in the batch size of 256. The total number of steps for training was 1,000,000, which approximately covers 40 epochs with respect to YAGO entities. We use the Adam W optimizer accompanied by the cosine scheduler with lrmin = 5 × 10−6, lrmax = 1 × 10−4 and a warmup over the first 10,000 steps. The dropout rate was fixed to 0.1 and the gelu activation function was used. [...] Table 4: Hyperparameter space for CARTE and baseline estimators. |