Matryoshka Diffusion Models

Authors: Jiatao Gu, Shuangfei Zhai, Yizhe Zhang, Joshua M. Susskind, Navdeep Jaitly

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the effectiveness of our approach on various benchmarks, including class-conditioned image generation, high-resolution text-to-image, and text-to-video applications. Remarkably, we can train a single pixel-space model at resolutions of up to 1024 1024 pixels, demonstrating strong zero shot generalization using the CC12M dataset, which contains only 12 million images.
Researcher Affiliation Industry Jiatao Gu, Shuangfei Zhai, Yizhe Zhang, Josh Susskind & Navdeep Jaitly Apple {jgu32,szhai,yizzhang,jsusskind,njaitly}@apple.com
Pseudocode Yes A pseudo code for Nested UNet compared with standard UNet is present as follows.
Open Source Code Yes Code and pre-trained checkpoints are released at https://github.com/apple/ml-mdm.
Open Datasets Yes For image generation, we performed class-conditioned generation on Image Net (Deng et al., 2009) at 256 256, and performed general purpose text-to-image generation using Conceptual 12M (CC12M, Changpinyo et al., 2021) at both 256 256 and 1024 1024 resolutions. As additional evidence of generality, we show results on text-to-video generation using Web Vid-10M (Bain et al., 2021) at 16 256 256.
Dataset Splits Yes More specifically, we randomly sample 1/1000 of pairs as the validation set where we monitor the CLIP and FID scores during training, and use the remaining data for training.
Hardware Specification Yes We use 8 A100 GPUs for Image Net, and 32 A100 GPUs for CC12M and Web Vid-10M, respectively.
Software Dependencies No No specific version numbers for software dependencies (e.g., PyTorch, TensorFlow) or libraries were provided beyond general optimizer names or model architectures.
Experiment Setup Yes For all experiments, we share all the following training parameters except the batch size and training steps differ across different experiments. default training config: optimizer= adam adam_beta1=0.9 adam_beta2=0.99 adam_eps=1.e-8 learning_rate=1e-4 learning_rate_warmup_steps=30_000 weight_decay=0.0 gradient_clip_norm=2.0 ema_decay=0.9999 mixed_precision_training=bp16