reproducibilityindex.ai

Binning as a Pretext Task: Improving Self-Supervised Learning in Tabular Domains

Authors: Kyungeun Lee, Ye Seul Sim, Hyeseung Cho, Moonjung Eo, Suhee Yoon, Sanghyu Yoon, Woohyung Lim

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Comprehensive evaluations across diverse tabular datasets corroborate that our method consistently improves tabular representation learning performance for a wide range of downstream tasks. Our empirical investigations ascertain several advantages of binning: capturing the irregular function, compatibility with encoder architecture and additional modifications, standardizing all features into equal sets, grouping similar values within a feature, and providing ordering information.
Researcher Affiliation	Industry	1LG AI Research, Seoul, Repulic of Korea. Correspondence to: Woohyung Lim <w.lim@lgresearch.ai>.
Pseudocode	No	The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	The codes are available in https://github. com/kyungeun-lee/tabularbinning.
Open Datasets	Yes	In this study, we use 25 public datasets mostly from the Open ML (Vanschoren et al., 2014) library, including the frequently used datasets in previous studies (Yoon et al., 2020; Ucar et al., 2021; Gorishniy et al., 2021; 2022). We summarize the main properties of datasets in Table 4.
Dataset Splits	Yes	For all datasets, we apply standardization for numerical features and labels for evaluating the regression tasks. Each dataset has exactly one train-validation-test split, so all algorithms use the same splits as the previous studies (Gorishniy et al., 2021; 2022; Rubachev et al., 2022).
Hardware Specification	Yes	All experiments are conducted on a single NVIDIA Ge Force RTX 3090.
Software Dependencies	No	The paper mentions "Optimizer: Adam W" but does not specify version numbers for programming languages, machine learning frameworks (e.g., PyTorch, TensorFlow), or other key software libraries used for implementation.
Experiment Setup	Yes	For the hyperparameters related to SSL, we tried pm {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9} and the number of bins T {2, 5, 10, 20, 50, 100}. Optimizer: Adam W, Learning rate: 1e-4, Weight decay: 1e-5, Epochs: 1000, Learning rate scheduler: Cosine annealing scheduler. We summarize the best setups for all datasets as follows. Table 9: Training setups for the best cases in Table 7.