Binning as a Pretext Task: Improving Self-Supervised Learning in Tabular Domains
Authors: Kyungeun Lee, Ye Seul Sim, Hyeseung Cho, Moonjung Eo, Suhee Yoon, Sanghyu Yoon, Woohyung Lim
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive evaluations across diverse tabular datasets corroborate that our method consistently improves tabular representation learning performance for a wide range of downstream tasks. Our empirical investigations ascertain several advantages of binning: capturing the irregular function, compatibility with encoder architecture and additional modifications, standardizing all features into equal sets, grouping similar values within a feature, and providing ordering information. |
| Researcher Affiliation | Industry | 1LG AI Research, Seoul, Repulic of Korea. Correspondence to: Woohyung Lim <w.lim@lgresearch.ai>. |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The codes are available in https://github. com/kyungeun-lee/tabularbinning. |
| Open Datasets | Yes | In this study, we use 25 public datasets mostly from the Open ML (Vanschoren et al., 2014) library, including the frequently used datasets in previous studies (Yoon et al., 2020; Ucar et al., 2021; Gorishniy et al., 2021; 2022). We summarize the main properties of datasets in Table 4. |
| Dataset Splits | Yes | For all datasets, we apply standardization for numerical features and labels for evaluating the regression tasks. Each dataset has exactly one train-validation-test split, so all algorithms use the same splits as the previous studies (Gorishniy et al., 2021; 2022; Rubachev et al., 2022). |
| Hardware Specification | Yes | All experiments are conducted on a single NVIDIA Ge Force RTX 3090. |
| Software Dependencies | No | The paper mentions "Optimizer: Adam W" but does not specify version numbers for programming languages, machine learning frameworks (e.g., PyTorch, TensorFlow), or other key software libraries used for implementation. |
| Experiment Setup | Yes | For the hyperparameters related to SSL, we tried pm {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9} and the number of bins T {2, 5, 10, 20, 50, 100}. Optimizer: Adam W, Learning rate: 1e-4, Weight decay: 1e-5, Epochs: 1000, Learning rate scheduler: Cosine annealing scheduler. We summarize the best setups for all datasets as follows. Table 9: Training setups for the best cases in Table 7. |