Maximum Likelihood Training of Score-Based Diffusion Models

Authors: Yang Song, Conor Durkan, Iain Murray, Stefano Ermon

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically observe that maximum likelihood training consistently improves the likelihood of score-based diffusion models across multiple datasets, stochastic processes, and model architectures. We empirically test the performance of likelihood weighting, importance sampling and variational dequantization across multiple architectures of score-based models, SDEs, and datasets. In particular, we consider DDPM++ ( Baseline in Table 2) and DDPM++ (deep) ( Deep in Table 2) models with VP and sub VP SDEs [48] on CIFAR-10 [28] and Image Net 32ˆ32 [55] datasets. We omit experiments on the VE SDE since (i) under this SDE our likelihood weighting is the same as the original weighting in [48]; (ii) we empirically observe that the best VE SDE model achieves around 3.4 bits/dim on CIFAR-10 in our experiments, which is significantly worse than other SDEs. For each experiment, we report Erlog p ODE θ pxqs ( Negative log-likelihood in Table 2), and the upper bound Er LDSM θ pxqs ( Bound in Table 2). In addition, we report FID scores [17] for samples from p ODE θ , produced by solving the corresponding ODE with the Dormand Prince RK45 [14] solver.
Researcher Affiliation Academia Yang Song Computer Science Department Stanford University yangsong@cs.stanford.edu Conor Durkan School of Informatics University of Edinburgh conor.durkan@ed.ac.uk Iain Murray School of Informatics University of Edinburgh i.murray@ed.ac.uk Stefano Ermon Computer Science Department Stanford University ermon@cs.stanford.edu
Pseudocode No The paper describes methods and processes through text and mathematical equations, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes Code is released at https://github.com/yang-song/score_flow.
Open Datasets Yes We empirically test the performance of likelihood weighting, importance sampling and variational dequantization across multiple architectures of score-based models, SDEs, and datasets. In particular, we consider DDPM++ ( Baseline in Table 2) and DDPM++ (deep) ( Deep in Table 2) models with VP and sub VP SDEs [48] on CIFAR-10 [28] and Image Net 32ˆ32 [55] datasets.
Dataset Splits No The paper mentions using CIFAR-10 and ImageNet 32x32 datasets and training steps/batch sizes, but it does not explicitly provide information on how the data was split into training, validation, and test sets (e.g., percentages or specific sample counts).
Hardware Specification Yes All models were trained on Google Cloud TPUs (v3-8 and v3-32).
Software Dependencies No The paper describes training details, but does not explicitly list specific software dependencies with version numbers (e.g., 'PyTorch 1.x', 'Python 3.x', or 'CUDA 11.x').
Experiment Setup Yes For CIFAR-10 experiments, we train our models for 250,000 optimization steps, with a batch size of 256. For ImageNet 32x32 experiments, we train our models for 500,000 optimization steps, with a batch size of 256. For both datasets, we use the Adam optimizer [29] with β1 = 0.9, β2 = 0.999 and ϵ = 1e−8. The learning rate starts at 2e−4 and decays linearly to 1e−5 after 100,000 optimization steps.