Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Hybrid Autoencoders for Tabular Data: Leveraging Model-Based Augmentation in Low-Label Settings

Authors: Erel Naor, Ofir Lindenbaum

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our main contributions are as follows:(1) We introduce TANDEM, a hybrid self-supervised autoencoder that combines a neural encoder, an oblivious soft decision tree encoder, a shared decoder, and sample-specific stochastic gating networks, enabling the learning of complementary representations suited to tabular data. (2) We demonstrate that the representations learned by the neural encoder enable strong performance for both classification and regression tasks under low-label conditions, surpassing established deep learning and tree-based baselines. (3) We conduct extensive experiments across a diverse suite of tabular datasets and systematically vary the number of labeled samples (from 50 to 1000 per dataset), establishing the robustness of TANDEM in a range of low-label regimes. (4) We provide both qualitative spectral analysis and quantitative comparison of gating activations, revealing how the two encoders capture distinct and complementary inductive biases.
Researcher Affiliation	Academia	Erel Naor Bar-Ilan University EMAIL Ofir Lindenbaum Bar-Ilan University EMAIL
Pseudocode	No	The paper describes the model architecture and training objective in detail, including equations for the OSDT encoder and loss functions, but it does not present a clearly labeled pseudocode or algorithm block.
Open Source Code	Yes	If you ran experiments, did you include the code, data, and instructions needed to reproduce the main experimental results? [Yes] We include code to extract datasets and reproduce results as part of the supplementary material.
Open Datasets	Yes	Our experiments are conducted in the low-label regime where labeled budgets range from 50 to 1000 samples per dataset, enabling consistent evaluation under limited supervision. We used all relevant classification datasets from Open ML analyzed in two widely cited studies on the limitations of deep learning for tabular data [9, 18]. We retained only datasets with at least 2,500 samples per class. For each class, 2,000 samples were allocated for self-supervised pretraining, and up to 1,000 labeled samples in total (across classes) were reserved for downstream evaluation. This filtering resulted in 19 classification datasets. For regression, we extracted 13 datasets that satisfy the same filtering as in the classification benchmark from the same Open ML sources referenced above.
Dataset Splits	No	The paper states: 'For each class, 2,000 samples were allocated for self-supervised pretraining, and up to 1,000 labeled samples in total (across classes) were reserved for downstream evaluation.' and 'Early stopping was based on validation accuracy or MSE, as appropriate.' While it specifies the total number of samples for pretraining and labeled data, and implies the use of a validation set, it does not explicitly provide the specific percentages or counts for training, validation, and test splits for the downstream task from the 'up to 1,000 labeled samples'.
Hardware Specification	Yes	All timing measurements reported below were collected on an NVIDIA L4 GPU (cloud instance).
Software Dependencies	No	The paper mentions various models and libraries like XGBoost, Cat Boost, PyTorch (implicitly for neural networks), and Optuna for hyperparameter tuning. However, it does not provide specific version numbers for these software dependencies (e.g., Python 3.x, PyTorch 1.x, Optuna x.y.z).
Experiment Setup	Yes	Pretraining was run for 100 epochs with a batch size of 128, and this configuration was held constant across all experiments to ensure fair and consistent comparison. We used RMSprop as the optimizer for both pretraining and fine-tuning across all models. The training objective combined reconstruction, output alignment, and latent similarity losses, as described in Section 3. Hyperparameters, including learning rate, encoder depth, and weight decay, were selected using Optuna over 50 trials based on the validation loss. Complete optimization, including computational details and parameter ranges, is provided in the appendix. For downstream evaluation, a single-layer MLP classifier or regressor was trained on the labeled subset using the neural encoder. The encoder was frozen for the first 25 epochs, then fine-tuned for an additional 25 epochs at a reduced learning rate. Early stopping was based on validation accuracy or MSE, as appropriate. Table G.1: Hyperparameter search space for all models. Each model was optimized using Optuna with 50 trials. Gating components (when applicable) were tuned separately. (Includes detailed ranges for learning rate, batch size, epochs, max depth, dropout, etc.)