Learning Vine Copula Models for Synthetic Data Generation

Authors: Yi Sun, Alfredo Cuesta-Infante, Kalyan Veeramachaneni5049-5057

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Throughout experiments on synthetic and real-world datasets, we show that our proposed approach fits the data better in terms of loglikelihood. Moreover, we demonstrate that the model is able to generate high-quality samples in a variety of applications, making it a good candidate for synthetic data generation.
Researcher Affiliation Academia Yi Sun,1 Alfredo Cuesta-Infante,2 Kalyan Veeramachaneni1 1MIT, 2Universidad Rey Juan Carlos yis@mit.edu, alfredo.cuesta@urjc.es, kalyan@csail.mit.edu
Pseudocode Yes Figure 4: Algorithm for learning vine structure; Figure 7: Algorithm for sampling from the learned vine
Open Source Code No The paper does not provide any explicit statements about releasing source code or links to a code repository for the methodology described.
Open Datasets Yes The three data sets used in experiments are: Wisconsin Breast Cancer Describes 30 variables computed from a digitized image of a fine needle aspirate (FNA) of a breast mass and a binary variable indicating if the mass is benign or malignant. This dataset includes 569 instances. Wine Quality This dataset includes 11 physiochemical variables and a quality score between 0 and 10 for red (1599 instances) and white (4898 instances) variants of the Portuguese Vinho Verde wine. Crime The communities and crime dataset includes 100 variables related to crimes ranging from socio-economic data to law enforcement data and an attribute to be predicted (Per Capita Violent Crime) . This dataset has 1994 instances.
Dataset Splits No The paper mentions '10-fold cross-validation' for evaluation but does not specify a distinct training/validation/test split with percentages or counts for a separate validation set.
Hardware Specification No The paper mentions experiments were run 'with a single GPU' but does not specify the GPU model, CPU, memory, or other detailed hardware specifications.
Software Dependencies No The paper mentions various machine learning techniques and neural network components (e.g., LSTM, ReLU) but does not provide specific version numbers for any software libraries or dependencies used.
Experiment Setup Yes The batch size used is 64 for breast cancer dataset and 128 for the other two datasets. The neural network used for creating vines for the vector representation is set-up as fully connected feed forward neural networks with two hidden layers. Each layer uses Re LU as activation function and the output layer is normalized by a softmax layer. The network for the reinforcement learning representation is set up as LSTM. The algorithm is trained over 50 epochs in the experiments. For ease of computation, the learned vines are truncated after the third level, which means all pair copulas are assumed to be independent beyond the third level. All results reported are based on 10-fold cross-validation over different splits of training and testing set.