reproducibilityindex.ai

Synthetic Data Generation of Many-to-Many Datasets via Random Graph Generation

Authors: Kai Xu, Georgi Ganev, Emile Joubert, Rees Davison, Olivier Van Acker, Luke Robinson

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through a real-world dataset, we demonstrate that our method can generate synthetic datasets while preserving information within and across tables better than its closest competitor.
Researcher Affiliation	Collaboration	Kai Xu Hazy me@xuk.ai; Georgi Ganev Hazy georgi@hazy.com; Emile Joubert Hazy emile@hazy.com; Rees Davison Hazy rees@hazy.com; Olivier Van Acker Hazy ovanac01@mail.bbk.ac.uk; Luke Robinson Hazy luke@hazy.com. Now at Amazon; work done prior to joining Amazon. Also a Ph.D candidate at the University College London.
Pseudocode	Yes	Algorithm 1 Bipartite 2K-generator (Boroojeni et al., 2017, corrected)
Open Source Code	Yes	Our implementations of Bayes M2M and Neural M2M are available at github.com/hazy/m2m.
Open Datasets	Yes	Dataset We consider the MOVIELENS in our evaluation. MOVIELENS is a dataset that contains users ratings to different movies (Harper & Konstan, 2015).
Dataset Splits	Yes	Data is split by 80/20 into training and validation sets.
Hardware Specification	Yes	All experiments are conducted on an Amazon EC2 instance of type c5.4xlarge with CPUs only.
Software Dependencies	No	The paper mentions software like Flux.jl, ParameterSchedulers.jl, and Node2Vec.jl by name and provides links to their repositories, but it does not specify concrete version numbers for these software components, which is required for reproducibility.
Experiment Setup	Yes	All training is done using the ADAM optimizer (Kingma & Ba, 2014) for maximally 1, 000 steps with a learning rate of 1 10 3 and a batch size of 500. We also use a learning rate scheduler with the sine function with exponential amplitude decay with a minimal learning rate of 2 10 4 and a 5-epoch periodicity. Finally, early training is used based on the loss computed on the validation set: if there is no loss drop for 5 epochs, the training is early stopped. node2vec: setting the number of walks to be 10, the walk length to be 100, p = q = 2 and an embedding size of 20.