Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

LLM Meeting Decision Trees on Tabular Data

Authors: Hangting Ye, Jinmeng Li, He Zhao, Dandan Guo, Yi Chang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, extensive experiments on diverse tabular benchmarks show that our method achieves state-of-the-art performance. The source code is available at https://github.com/Hangting Ye/De LTa. ... We conduct extensive experiments on various tabular benchmarks and competing benchmark algorithms, and comprehensive results along with analysis and visualizations demonstrate our effectiveness. ... 5 Experiments
Researcher Affiliation	Academia	School of Artificial Intelligence, Jilin University1 CSIRO s Data612; Monash University3; International Center of Future Science, Jilin University4 Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, MOE, China5 EMAIL, EMAIL, EMAIL
Pseudocode	Yes	The complete training and inference pipeline is summarized in Algorithm 1 of Appendix A.3.
Open Source Code	Yes	The source code is available at https://github.com/Hangting Ye/De LTa.
Open Datasets	Yes	Specifically, the tabular datasets include: Blood (BL) [56], Credit (CR) [57], Car [58], Bank (BA) [59], Adult (AD) [60], Jannis (JA) [11], Cpu_act (CP) [11], Credit_reg (CRR) [11], California_housing (CA) [61], House_16H (HO) [11], Fried (FR) [62], Diomand (DI) [63].
Dataset Splits	No	For data splits, Dtrain denotes training set for model training, Dval validation set for early stopping and hyperparameter tuning, and Dtest test set for final evaluation. ... Following prior work [24], we simulate the few-shot setting by randomly sampling a fixed number of training examples, where the number of shots denotes the total number of selected samples. The paper refers to external benchmarks and sampling strategies but does not provide specific percentages or counts for the train/validation/test splits within the paper itself.
Hardware Specification	Yes	The experiments are run on NVIDIA A100-PCIE-40GB GPU.
Software Dependencies	No	For LLM usage, De LTa adopts GPT-4o as its LLM backbone... we use the API provided by Open AI to perform black-box GPT-3.5 fine-tuning for LIFT... TP-BERTa employs Ro BERTa [66]; GTL uses the 13B version of LLa MA 2 [67]. The paper mentions software components but does not provide specific version numbers for all key dependencies, such as 'scikit-learn' or the specific API client versions for the LLMs.
Experiment Setup	No	Implementation details of De LTa including hyper-parameters are provided in Appendix A.3. ... Here, L is a smooth loss function (e.g., mean squared error for regression or cross-entropy for classification). And F (x)L(F(x), y) R for regression, F (x)L(F(x), y) R for binary classification, and F (x)L(F(x), y) Rc for multiclass classification, where c is the number of classes. The Gradient Net ϕ contains learnable leaf node-specific mapping function ϕl, where ϕl is implemented by CART for classification and Tab PFN for regression, and ϕl over different leaf nodes are trained separately. While the paper states hyperparameters are in Appendix A.3, it only mentions the number of LLM queries (10 times by default) and the choice of loss function. It lacks specific hyperparameter values like learning rates, batch sizes, or epochs for the various models used (Random Forest, CART, Tab PFN).