Learning Energy-Based Prior Model with Diffusion-Amortized MCMC

Authors: Peiyu Yu, Yaxuan Zhu, Sirui Xie, Xiaojian (Shawn) Ma, Ruiqi Gao, Song-Chun Zhu, Ying Nian Wu

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on several image modeling benchmark datasets demonstrate the superior performance of our method compared with strong counterparts1. 1Code and data available at https://github.com/yu Peiyu98/Diffusion-Amortized-MCMC. 4 Experiments In this section, we are interested in the following questions: (i) How does the proposed method compare with its previous counterparts (e.g., purely MCMC-based or variational methods)? (ii) How is the scalability of this method? (iii) How are the time and parameter efficiencies? (iv) Does the proposed method provide a desirable latent space? To answer these questions, we present a series of experiments on benchmark datasets including MNIST [42], SVHN [43], Celeb A64 [44], CIFAR10 [45], Celeb AMask-HQ [46], FFHQ [10] and LSUN-Tower [47]. As to be shown, the proposed method demonstrates consistently better performance in various experimental settings compared with previous methods.
Researcher Affiliation Collaboration Peiyu Yu1 yupeiyu98@g.ucla.edu Yaxuan Zhu1 yaxuanzhu@g.ucla.edu Sirui Xie2 srxie@ucla.edu Xiaojian Ma2,4 xiaojian.ma@ucla.edu Ruiqi Gao3 ruiqig@google.com Song-Chun Zhu4 sczhu@stat.ucla.edu Ying Nian Wu1 ywu@stat.ucla.edu 1UCLA Department of Statistics 2UCLA Department of Computer Science 3Google DeepMind 4Beijing Institute for General Artificial Intelligence (BIGAI)
Pseudocode Yes C Pytorch-style Pseudocode We provide pytorch-style pseudocode to help understand the proposed method. We denote the generator network as G, the energy score network as E and the diffusion network as Q. The first page sketches the prior and posterior sampling process. The second page outlines the learning procedure. Listing 1: Prior and posterior LD sampling. Listing 2: Learning LEBM with DAMC.
Open Source Code Yes 1Code and data available at https://github.com/yu Peiyu98/Diffusion-Amortized-MCMC.
Open Datasets Yes We present a series of experiments on benchmark datasets including MNIST [42], SVHN [43], Celeb A64 [44], CIFAR10 [45], Celeb AMask-HQ [46], FFHQ [10] and LSUN-Tower [47]. ...Datasets We include the following datasets to study our method: SVHN (32 32 3), CIFAR10 (32 32 3), Celeb A (64 64 3), Cele AMask-HQ (256 x 256 x 3) and MNIST (28 x 28 x 1). Following Pang et al. [22], we use the full training set of SVHN (73,257) and CIFAR-10 (50,000), and take 40,000 samples of Celeb A as the training data. We take 29,500 samples from the Celeb AMask-HQ dataset as the training data, and test the model on 500 held-out samples. For anomaly detection on MNIST dataset, we follow the experimental settings in [22, 41, 55, 56] and use 80% of the in-domain data to train the model.
Dataset Splits Yes Following Pang et al. [22], we use the full training set of SVHN (73,257) and CIFAR-10 (50,000), and take 40,000 samples of Celeb A as the training data. We take 29,500 samples from the Celeb AMask-HQ dataset as the training data, and test the model on 500 held-out samples. For anomaly detection on MNIST dataset, we follow the experimental settings in [22, 41, 55, 56] and use 80% of the in-domain data to train the model.
Hardware Specification Yes We run the experiments on a A6000 GPU with the batch size of 128. For GAN inversion, we reduce the batch size to 64.
Software Dependencies Yes The parameters of all the networks are initialized with the default pytorch methods [77]. We use the Adam optimizer [78] with β1 0.5 and β2 0.999 to train the generator network and the energy score network. We use the Adam W optimizer [79] with β1 0.5, β2 0.999 and weight_decay=1e-4 to train the diffusion network.
Experiment Setup Yes For the posterior and prior DAMC samplers, we set the number of diffusion steps to 100. The number of iterations in Eq. (8) is set to M 6 for the experiments. The LD runs T 30 and T 60 iterations for posterior and prior updates during training with a step size of s 0.1. For test time sampling from KT,zi|xiqϕkpzi|xiq, we set T 10 for the additional LD. For test time prior sampling of LEBM with LD, we follow [22, 41] and set T 100. To further stabilize the training procedure, we i) perform gradient clipping by setting the maximal gradient norm as 100, ii) use a separate target diffusion network which is the EMA of the current diffusion network to initialize the prior and posterior updates and iii) add noiseinitialized prior samples for the prior updates. These set-ups are identical across different datasets. ...The initial learning rates of the generator and diffusion networks are 2e-4, and 1e-4 for the energy score network. The learning rates are decayed with a factor of 0.99 every 1K training iterations, with a minimum learning rate of 1e-5.