Binning as a Pretext Task: Improving Self-Supervised Learning in Tabular Domains

Authors: Kyungeun Lee, Ye Seul Sim, Hyeseung Cho, Moonjung Eo, Suhee Yoon, Sanghyu Yoon, Woohyung Lim

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive evaluations across diverse tabular datasets corroborate that our method consistently improves tabular representation learning performance for a wide range of downstream tasks. Our empirical investigations ascertain several advantages of binning: capturing the irregular function, compatibility with encoder architecture and additional modifications, standardizing all features into equal sets, grouping similar values within a feature, and providing ordering information.
Researcher Affiliation Industry 1LG AI Research, Seoul, Repulic of Korea. Correspondence to: Woohyung Lim <w.lim@lgresearch.ai>.
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes The codes are available in https://github. com/kyungeun-lee/tabularbinning.
Open Datasets Yes In this study, we use 25 public datasets mostly from the Open ML (Vanschoren et al., 2014) library, including the frequently used datasets in previous studies (Yoon et al., 2020; Ucar et al., 2021; Gorishniy et al., 2021; 2022). We summarize the main properties of datasets in Table 4.
Dataset Splits Yes For all datasets, we apply standardization for numerical features and labels for evaluating the regression tasks. Each dataset has exactly one train-validation-test split, so all algorithms use the same splits as the previous studies (Gorishniy et al., 2021; 2022; Rubachev et al., 2022).
Hardware Specification Yes All experiments are conducted on a single NVIDIA Ge Force RTX 3090.
Software Dependencies No The paper mentions "Optimizer: Adam W" but does not specify version numbers for programming languages, machine learning frameworks (e.g., PyTorch, TensorFlow), or other key software libraries used for implementation.
Experiment Setup Yes For the hyperparameters related to SSL, we tried pm {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9} and the number of bins T {2, 5, 10, 20, 50, 100}. Optimizer: Adam W, Learning rate: 1e-4, Weight decay: 1e-5, Epochs: 1000, Learning rate scheduler: Cosine annealing scheduler. We summarize the best setups for all datasets as follows. Table 9: Training setups for the best cases in Table 7.