Language Models are Realistic Tabular Data Generators
Authors: Vadim Borisov, Kathrin Sessler, Tobias Leemann, Martin Pawelczyk, Gjergji Kasneci
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the effectiveness of the proposed approach in a series of experiments that quantify the validity and quality of the produced data samples from multiple angles. |
| Researcher Affiliation | Academia | 1University of T ubingen, T ubingen, Germany 2Technical University of Munich, Munich, Germany |
| Pseudocode | No | The paper includes figures illustrating the data pipeline and sampling procedure (Fig. 2 and Fig. 3) but does not contain any formally labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | Finally, we provide an easy-to-use Python implementation of the GRea T model, where it takes only three lines of code to generate new synthetic samples. Access to the package is provided via pip install be-great.1 https://github.com/kathrinse/be_great |
| Open Datasets | Yes | Key characteristics of each data set are presented in Table 7. We split all data sets into 80% train and 20% test sets to avoid any data leakage. All models are trained or fine-tuned on the same training data samples. Adult Income https://archive.ics.uci.edu/ml/datasets/Adult/ (from Table 9). |
| Dataset Splits | No | We split all data sets into 80% train and 20% test sets to avoid any data leakage. |
| Hardware Specification | Yes | Our hardware setup consisted of two NVIDIA 2080RTX GPUs with 12 GB RAM each, 126 GB system RAM, and AMD Ryzen 3960X with 24 cores, we use the Ubuntu 20.04 operation system. |
| Software Dependencies | No | We utilize pretrained generative language models from the established Hugging Face framework (Wolf et al., 2020). For the ML efficiency and discriminator experiments... we additionally use linear/logistic regression, decision tree, and random forest models from the Scikit-Learn package (Buitinck et al., 2013). (No specific version numbers provided for these software components) |
| Experiment Setup | Yes | We fine-tune the Distill-GRea T model for each data set for 200 epochs, except for the California housing and Diabetes data sets, for them, we fine-tune them for 100 epochs. The GRea T baseline is fine-tuned for 110, 310, 400, 255, 150, 85, epochs for California Housing, Adult Income, Travel, Home Equity Line of Credit (HELOC), Sick, and Diabetes data sets, respectively. Depending on the GPU memory limitations, we vary the batch size from 8 to 124. For the sampling step, we set the temperature parameter T to 0.7 for all experiments and data sets. We utilize the Adam W optimizer... with the learning rate 5e-5. |