Graphically Structured Diffusion Models

Authors: Christian Dietrich Weilbach, William Harvey, Frank Wood

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Across a diverse set of experiments we improve the scaling relationship between problem dimension and our model s performance, in terms of both training time and final accuracy. Our code can be found at https: //github.com/plai-group/gsdm. 4. Experiments Our experiments compare GSDM against ablations including a non-sparse version (i.e. a vanilla DM), as well as the variational auto-encoder for arbitrary conditioning (VAEAC) (Ivanov et al., 2019) and, where appropriate, the best performing MCMC method we tried: Lightweight Metropolis Hastings (LMH) (Wingate et al., 2011).
Researcher Affiliation Academia 1Department of Computer Science, University of British Columbia, Vancouver, Canada. Correspondence to: Christian Weilbach <weilbach@cs.ubc.ca>.
Pseudocode Yes Figure 19: Source code of a full generative model for the BCMF experiment. Passing this into our compiler yields the attention mask in Figure 18. Note that intermediate variables for C are explicitly created by sampling from a dirac distribution.
Open Source Code Yes Our code can be found at https: //github.com/plai-group/gsdm.
Open Datasets No All data is sampled synthetically on-the-fly, so data points used in one minibatch are never repeated in another minibatch. The paper describes generating synthetic data for its experiments (e.g., "Our data generator creates complete 9x9 Sudokus"), rather than using or providing access information for a pre-existing, publicly available dataset.
Dataset Splits No Accuracy is computed on 16 validation examples every 500 iterations, to a maximum of 20 000. We compute validation metrics regularly throughout training, in particular every 1000 iterations for BCMF and HBCMF and every 5000 iterations for sorting and Sudoku. The paper mentions using "validation examples" and computing "validation metrics," but since data is generated on-the-fly, it does not specify fixed train/validation/test splits with percentages or counts for a static dataset.
Hardware Specification Yes We use NVIDIA A100 GPUs for sorting and BCMF, and smaller NVIDIA RTX A5000s for all ablations and other problems.
Software Dependencies No We use the Adam optimizer with β1 = 0.9 and β2 = 0.999 (Kingma & Ba, 2015), no weight decay and gradient clipping at 1.0. We then simply round each element in R to be in {0, 1}. The paper mentions specific optimizers and refers to tools like Scikit-learn, but does not provide specific version numbers for software components (e.g., Python, PyTorch, Scikit-learn) which are necessary for full reproducibility.
Experiment Setup Yes Table 2: Experimental parameters. Listed training times refer to those used in Figures 3 and 5. The numbers of training iterations refer to those listed in the same plot with the listed problem dimension. They vary with problem dimensions as we trained all dimensions for a fixed training time on each problem, and the time per iteration depends on problem dimension. The training curves in Figure 6 were obtained by training for longer in some cases. The dimensions listed are those used in Figure 6; dimensions are varied and clearly stated in other results. Parameter Sorting Sudoku BCMF Boolean Problem dimension n = 20 9 9 n, m, k = 16, 10, 8 Training time 1 day 1 day 8 hours 40-160 min. Training iters (1000s) 120 320 20 Batch size 16 32 8 16 Learning rate 2 10 4 2 10 5 2 10 5 2 10 5 Embedding dim. 64 128 64 64 # transformer layers 6 6 12 12 # attention heads 8 8 2 2 GPU type A100 A5000 A100 A5000 VAEAC learning rate 3 10 5 3 10 4 3 10 5 3 10 5 LMH warmup samples 5000 5000 -. We use 1000 diffusion timesteps in all experiments and set the hyperparameters β1, . . . , β1000 using a linear interpolation schedule (Ho et al., 2020) from β1 = 10 4 to β1000 = 0.005. Finally, we use the Adam optimizer with β1 = 0.9 and β2 = 0.999 (Kingma & Ba, 2015), no weight decay and gradient clipping at 1.0.