Learning Energy-Based Models by Cooperative Diffusion Recovery Likelihood

Authors: Yaxuan Zhu, Jianwen Xie, Ying Nian Wu, Ruiqi Gao

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Combining these advances, our approach significantly boost the generation performance compared to existing EBM methods on CIFAR-10 and Image Net datasets. We also demonstrate the effectiveness of our models for several downstream tasks, including classifierfree guided generation, compositional generation, image inpainting and out-of-distribution detection. We evaluate the performance of our model across various scenarios. Specifically, Section 4.1 demonstrates the capacity of unconditional generation. Section 4.2 highlights the potential of our model to further optimize sampling efficiency. The focus shifts to conditional generation and classifierfree guidance in Section 4.3. Section 4.4 elucidates the power of our model in performing likelihood estimation and OOD detection, and Section 4.5 showcases compositional generation. Please refer to Appendix B for implementation details, Appendix C.1 for image inpainting with our trained models, Appendix D.3 for comparing the sampling time between our approach and other EBM models, Appendix D.4 for understanding the role of EBM and initializer in the generation process and Appendix D for the ablation study.
Researcher Affiliation Collaboration Yaxuan Zhu UCLA yaxuanzhu@g.ucla.edu Jianwen Xie Akool Research jianwen@ucla.edu Ying Nian Wu UCLA ywu@stat.ucla.edu Ruiqi Gao Google Deep Mind ruiqig@google.com
Pseudocode Yes Algorithm 1 CDRL Training Input: (1) observed data x0 pdata(x); (2) Number of noise levels T; (3) Number of Langevin sampling steps K per noise level; (4) Langevin step size at each noise level st; (5) Learning rate ηθ for EBM fθ; (6) Learning rate ηϕ for initializer gϕ; Output: Parameters θ, ϕ Randomly initialize θ and ϕ. repeat Sample noise level t from {0, 1, ..., T 1}. Sample ϵ N(0, I). Let xt+1 = αt+1x0 + σt+1ϵ, yt = αt+1( αtx0 + σtϵ). Generate the initial sample ˆyt following Equation 6. Generate the refined sample yt by running K steps of Langevin dynamics starting from ˆyt following Equation 5. Update EBM parameter θ following the gradients in Equation 7. Update initializer parameter ϕ by maximizing Equation 8. until converged
Open Source Code No The paper does not provide an explicit statement or link for the open-source code of the methodology.
Open Datasets Yes Our experiments primarily involve three datasets: (i) CIFAR-10 (Krizhevsky & Hinton, 2009) comprises images from 10 categories, with 50k training samples and 10k test samples at a resolution of 32 32 pixels. We use its training set for evaluating our model in the task of unconditional generation. (ii) Image Net (Deng et al., 2009) contains approximately 1.28M images from 1000 categories. We use its training set for both conditional and unconditional generation, focusing on a downsampled version (32 32) of the dataset. (iii) Celeb A (Liu et al., 2015) consists of around 200k human face images, each annotated with attributes. We downsample each image of the dataset to the size of 64 64 pixels and utilize the resized dataset for compositionality and image inpainting tasks.
Dataset Splits No For CIFAR-10, the paper mentions '50k training samples and 10k test samples' but does not specify a validation set split. Other datasets also lack explicit validation split information.
Hardware Specification Yes Training is conducted across 8 Nvidia A100 GPUs, typically requiring approximately 400k iterations, which spans approximately 6 days. We conduct the sampling process of each model individually on a single A6000 GPU to generate a batch of 100 samples on the Cifar10 dataset.
Software Dependencies No The paper mentions using the Adam optimizer but does not specify versions for any key software components like programming languages, frameworks, or libraries.
Experiment Setup Yes We set the learning rate of EBM to be ηθ = 1e 4 and the learning rate of initializer to be ηϕ = 1e 5. We use linear warm up for both EBM and initializer and let the initializer to start earlier than EBM. More specifically, given training iteration iter, we have: ηθ = min(1.0, iter 10000) 1e 4 ηϕ = min(1.0, iter + 500 10000 ) 1e 5 We use the Adam optimizer Kingma & Ba (2015); Loshchilov & Hutter (2019) to train both the EBM and the initializer, with β = (0.9, 0.999) and a weight decay equal to 0.0. We also apply exponential moving average with a decay rate equal to 0.9999 to both the EBM and the initializer. Training is conducted across 8 Nvidia A100 GPUs, typically requiring approximately 400k iterations, which spans approximately 6 days. Following (Gao et al., 2021), we use a re-parameterization trick to calculate the energy term. Our EBM is constructed across noise levels t = 0, 1, 2, 3, 4, 5 and we assume the distribution at noise level t = 6 is a simple Normal distribution during sampling. ... We use 15 steps of Langevin updates at each noise level, with the Langevin step size at noise level t given by s2 t = 0.054 σt σ2 t+1, (16) where σ2 t+1 is the variance of the added noise at noise level t+1 and σt is the standard deviation of the accumulative noise at noise level t.