Matryoshka Diffusion Models
Authors: Jiatao Gu, Shuangfei Zhai, Yizhe Zhang, Joshua M. Susskind, Navdeep Jaitly
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the effectiveness of our approach on various benchmarks, including class-conditioned image generation, high-resolution text-to-image, and text-to-video applications. Remarkably, we can train a single pixel-space model at resolutions of up to 1024 1024 pixels, demonstrating strong zero shot generalization using the CC12M dataset, which contains only 12 million images. |
| Researcher Affiliation | Industry | Jiatao Gu, Shuangfei Zhai, Yizhe Zhang, Josh Susskind & Navdeep Jaitly Apple {jgu32,szhai,yizzhang,jsusskind,njaitly}@apple.com |
| Pseudocode | Yes | A pseudo code for Nested UNet compared with standard UNet is present as follows. |
| Open Source Code | Yes | Code and pre-trained checkpoints are released at https://github.com/apple/ml-mdm. |
| Open Datasets | Yes | For image generation, we performed class-conditioned generation on Image Net (Deng et al., 2009) at 256 256, and performed general purpose text-to-image generation using Conceptual 12M (CC12M, Changpinyo et al., 2021) at both 256 256 and 1024 1024 resolutions. As additional evidence of generality, we show results on text-to-video generation using Web Vid-10M (Bain et al., 2021) at 16 256 256. |
| Dataset Splits | Yes | More specifically, we randomly sample 1/1000 of pairs as the validation set where we monitor the CLIP and FID scores during training, and use the remaining data for training. |
| Hardware Specification | Yes | We use 8 A100 GPUs for Image Net, and 32 A100 GPUs for CC12M and Web Vid-10M, respectively. |
| Software Dependencies | No | No specific version numbers for software dependencies (e.g., PyTorch, TensorFlow) or libraries were provided beyond general optimizer names or model architectures. |
| Experiment Setup | Yes | For all experiments, we share all the following training parameters except the batch size and training steps differ across different experiments. default training config: optimizer= adam adam_beta1=0.9 adam_beta2=0.99 adam_eps=1.e-8 learning_rate=1e-4 learning_rate_warmup_steps=30_000 weight_decay=0.0 gradient_clip_norm=2.0 ema_decay=0.9999 mixed_precision_training=bp16 |