Generative Modeling by Estimating Gradients of the Data Distribution

Authors: Yang Song, Stefano Ermon

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our models produce samples comparable to GANs on MNIST, Celeb A and CIFAR-10 datasets, achieving a new state-of-the-art inception score of 8.87 on CIFAR-10. Additionally, we demonstrate that our models learn effective representations via image inpainting experiments. Experimentally, we demonstrate the efficacy of our approach on MNIST, Celeb A [34], and CIFAR-10 [31]. We show that the samples look comparable to those generated from modern likelihood-based models and GANs. On CIFAR-10, our model sets the new state-of-the-art inception score of 8.87 for unconditional generative models, and achieves a competitive FID score of 25.32. We show that the model learns meaningful representations of the data by image inpainting experiments. For quantitative evaluation, we report inception [48] and FID [20] scores on CIFAR-10 in Tab. 1.
Researcher Affiliation Academia Yang Song Stanford University yangsong@cs.stanford.edu Stefano Ermon Stanford University ermon@cs.stanford.edu
Pseudocode Yes Algorithm 1 Annealed Langevin dynamics. Require: {σi}L i=1, ϵ, T. 1: Initialize x0 2: for i 1 to L do 3: αi ϵ σ2 i /σ2 L αi is the step size. 4: for t 1 to T do 5: Draw zt N(0, I) 6: xt xt 1 + αi 2 sθ( xt 1, σi) + αi zt 7: end for 8: x0 x T 9: end for
Open Source Code No The paper does not provide any explicit statements about releasing source code or links to a code repository for the described methodology.
Open Datasets Yes We use MNIST, Celeb A [34], and CIFAR-10 [31] datasets in our experiments. For Celeb A, the images are first center-cropped to 140 × 140 and then resized to 32 × 32. All images are rescaled so that pixel values are in [0, 1].
Dataset Splits No The paper mentions using MNIST, Celeb A, and CIFAR-10 datasets but does not explicitly provide details about the training, validation, or test data splits (e.g., percentages or sample counts).
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, or memory) used for running the experiments.
Software Dependencies No The paper does not provide specific version numbers for any software dependencies used in the experiments.
Experiment Setup Yes We choose L = 10 different standard deviations such that {σi}L i=1 is a geometric sequence with σ1 = 1 and σ10 = 0.01. When using annealed Langevin dynamics for image generation, we choose T = 100 and ϵ = 2 × 10−5, and use uniform noise as our initial samples. We found the results are robust w.r.t. the choice of T, and ϵ between 5 × 10−6 and 5 × 10−5 generally works fine. In the experiments, our model sθ(x, σ) combines the architecture design of U-Net [46] with dilated/atrous convolution [64, 65, 8] both of which have been proved very successful in semantic segmentation. In addition, we adopt instance normalization in our score network, inspired by its superior performance in some image generation tasks [57, 13, 23], and we use a modified version of conditional instance normalization [13] to provide conditioning on σi.