reproducibilityindex.ai

Language Models are Realistic Tabular Data Generators

Authors: Vadim Borisov, Kathrin Sessler, Tobias Leemann, Martin Pawelczyk, Gjergji Kasneci

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the effectiveness of the proposed approach in a series of experiments that quantify the validity and quality of the produced data samples from multiple angles.
Researcher Affiliation	Academia	1University of T ubingen, T ubingen, Germany 2Technical University of Munich, Munich, Germany
Pseudocode	No	The paper includes figures illustrating the data pipeline and sampling procedure (Fig. 2 and Fig. 3) but does not contain any formally labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	Finally, we provide an easy-to-use Python implementation of the GRea T model, where it takes only three lines of code to generate new synthetic samples. Access to the package is provided via pip install be-great.1 https://github.com/kathrinse/be_great
Open Datasets	Yes	Key characteristics of each data set are presented in Table 7. We split all data sets into 80% train and 20% test sets to avoid any data leakage. All models are trained or fine-tuned on the same training data samples. Adult Income https://archive.ics.uci.edu/ml/datasets/Adult/ (from Table 9).
Dataset Splits	No	We split all data sets into 80% train and 20% test sets to avoid any data leakage.
Hardware Specification	Yes	Our hardware setup consisted of two NVIDIA 2080RTX GPUs with 12 GB RAM each, 126 GB system RAM, and AMD Ryzen 3960X with 24 cores, we use the Ubuntu 20.04 operation system.
Software Dependencies	No	We utilize pretrained generative language models from the established Hugging Face framework (Wolf et al., 2020). For the ML efficiency and discriminator experiments... we additionally use linear/logistic regression, decision tree, and random forest models from the Scikit-Learn package (Buitinck et al., 2013). (No specific version numbers provided for these software components)
Experiment Setup	Yes	We fine-tune the Distill-GRea T model for each data set for 200 epochs, except for the California housing and Diabetes data sets, for them, we fine-tune them for 100 epochs. The GRea T baseline is fine-tuned for 110, 310, 400, 255, 150, 85, epochs for California Housing, Adult Income, Travel, Home Equity Line of Credit (HELOC), Sick, and Diabetes data sets, respectively. Depending on the GPU memory limitations, we vary the batch size from 8 to 124. For the sampling step, we set the temperature parameter T to 0.7 for all experiments and data sets. We utilize the Adam W optimizer... with the learning rate 5e-5.