Synthetic Data Generation of Many-to-Many Datasets via Random Graph Generation
Authors: Kai Xu, Georgi Ganev, Emile Joubert, Rees Davison, Olivier Van Acker, Luke Robinson
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through a real-world dataset, we demonstrate that our method can generate synthetic datasets while preserving information within and across tables better than its closest competitor. |
| Researcher Affiliation | Collaboration | Kai Xu Hazy me@xuk.ai; Georgi Ganev Hazy georgi@hazy.com; Emile Joubert Hazy emile@hazy.com; Rees Davison Hazy rees@hazy.com; Olivier Van Acker Hazy ovanac01@mail.bbk.ac.uk; Luke Robinson Hazy luke@hazy.com. Now at Amazon; work done prior to joining Amazon. Also a Ph.D candidate at the University College London. |
| Pseudocode | Yes | Algorithm 1 Bipartite 2K-generator (Boroojeni et al., 2017, corrected) |
| Open Source Code | Yes | Our implementations of Bayes M2M and Neural M2M are available at github.com/hazy/m2m. |
| Open Datasets | Yes | Dataset We consider the MOVIELENS in our evaluation. MOVIELENS is a dataset that contains users ratings to different movies (Harper & Konstan, 2015). |
| Dataset Splits | Yes | Data is split by 80/20 into training and validation sets. |
| Hardware Specification | Yes | All experiments are conducted on an Amazon EC2 instance of type c5.4xlarge with CPUs only. |
| Software Dependencies | No | The paper mentions software like Flux.jl, ParameterSchedulers.jl, and Node2Vec.jl by name and provides links to their repositories, but it does not specify concrete version numbers for these software components, which is required for reproducibility. |
| Experiment Setup | Yes | All training is done using the ADAM optimizer (Kingma & Ba, 2014) for maximally 1, 000 steps with a learning rate of 1 10 3 and a batch size of 500. We also use a learning rate scheduler with the sine function with exponential amplitude decay with a minimal learning rate of 2 10 4 and a 5-epoch periodicity. Finally, early training is used based on the loss computed on the validation set: if there is no loss drop for 5 epochs, the training is early stopped. node2vec: setting the number of walks to be 10, the walk length to be 100, p = q = 2 and an embedding size of 20. |