On improved Conditioning Mechanisms and Pre-training Strategies for Diffusion Models

Authors: Tariq Berrada Ifriqi, Pietro Astolfi, Melissa Hall, Reyhane Askari Hemmat, Yohann Benchetrit, Marton Havasi, Matthew Muckley, Karteek Alahari, Adriana Romero-Soriano, Jakob Verbeek, Michal Drozdzal

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we perform an in-depth study of LDM training recipes focusing on the performance of models and their training efficiency.
Researcher Affiliation Collaboration 1FAIR at Meta 2Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, France 3Mc Gill University 4Mila, Quebec AI institute 5Canada CIFAR AI chair
Pseudocode Yes Pseudocode 1: Rectified grid sampling for positional embeddings.
Open Source Code No While we are not be able to include code, we do provide specific and thorough detailing of our training and evaluation methods in Section 3.1 to enable experimental reproduction.
Open Datasets Yes To train class-conditional models, we use Image Net-1k [11], which has 1.3M images spanning 1,000 classes, as well as Image Net-22k [43], which contains 14.2M images spanning 21,841 classes. Additionally, we train text-to-image models using Conceptual 12M (CC12M) [6], which contains 12M images with accompanying manually generated textual descriptions.
Dataset Splits Yes The FID is reported w.r.t. the validation set of Image Net.
Hardware Specification Yes When training at 256 256 resolution, we use a batch size of 2, 048 images, a constant learning rate of 10 10 4, train our models on two machines with eight A100 GPUs each. In preliminary experiments with the Di T architecture we found that the FID metric on Image Net-1k at 256 resolution consistently improved with larger batches and learning rate, but that increasing the learning rate by another factor of two led to diverging runs. We report these results in supplementary. When training models at 512 512 resolution, we use the same approach but with a batch size of 384 distributed over 16 A100 GPUs.
Software Dependencies No The paper mentions "Py Torch" and "memory-efficient attention from Py Torch" (Appendix C), but does not provide specific version numbers for PyTorch or other libraries.
Experiment Setup Yes To train our models we use the Adam [28] optimizer, with a learning rate of 10 4 and β1, β2 = 0.9, 0.999. When training at 256 256 resolution, we use a batch size of 2, 048 images... Specifically, we use a quadratic beta schedule with βstart = 0.00085 and βend = 0.012.