Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers
Authors: Katherine Crowson, Stefan Andreas Baumann, Alex Birch, Tanishq Mathew Abraham, Daniel Z Kaplan, Enrico Shippole
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that HDi T performs competitively with existing models on Image Net 2562, and sets a new state-of-the-art for diffusion models on FFHQ-10242. Code is available at github.com/crowsonkb/k-diffusion. |
| Researcher Affiliation | Collaboration | 1Stability AI, United States 2Comp Vis @ LMU Munich, Germany 3Birchlabs, England, United Kingdom 4realiz.ai, New York, United States 5Independent Researcher, Florida, United States. Correspondence to: Katherine Crowson <crowsonkb@gmail.com>, Stefan Baumann <stefan.baumann@lmu.de>, Alex Birch <alex@birchlabs.co.uk>. |
| Pseudocode | No | The paper describes architectures and processes but does not include formal pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at github.com/crowsonkb/k-diffusion. |
| Open Datasets | Yes | We evaluate the proposed HDi T architecture on conditional and unconditional image generation, ablating over architectural choices (Section 5.2), and evaluating both megapixel pixel-space image generation (Section 5.3) and large-scale pixel-space image generation (Section 5.4). Training Unless mentioned otherwise, we train class-conditional models on Image Net (Deng et al., 2009) at a resolution of 128 128 directly on RGB pixels without any kind of latent representation. |
| Dataset Splits | No | The paper does not explicitly provide training/test/validation dataset splits. It mentions using 50k samples for FID computation, which typically refers to an evaluation set, but doesn't specify how the data was split for training/validation. |
| Hardware Specification | Yes | Training Hardware 4 A100 80Gi B 64 A100 80Gi B 8 H100 80Gi B |
| Software Dependencies | No | The paper mentions specific optimizers and samplers (e.g., Adam W, DPM++(3M) SDE sampling) but does not provide specific version numbers for software libraries or frameworks used (e.g., Python, PyTorch, TensorFlow, CUDA versions). |
| Experiment Setup | Yes | For further details, see Table 7. Parameter Image Net-1282 FFHQ-10242 Image Net-2562 Experiment Ablation E4 (Section 5.2) High-Res Synthesis (Section 5.3) Large-Scale (Section 5.4) Parameters 117M 85M 557M GFLOP/forward 31 206 198 Training Steps 400k 1M 2.2M Batch Size 256 256 256+5 Precision bfloat16 Training Hardware 4 A100 80Gi B 64 A100 80Gi B 8 H100 80Gi B Training Time 15 hours6 5 days6 7.6 days Patch Size 4 4 4 Levels (Local + Global Attention) 1 + 1 3 + 2 2 + 1 Depth [2, 11] [2, 2, 2, 2, 2] [2, 2, 16] Widths [384, 768] [128, 256, 384, 768, 1024] [384, 768, 1536] Attention Heads (Width / Head Dim) [6, 12] [2, 4, 6, 12, 16] [6, 12, 24] Attention Head Dim 64 64 64 Neighborhood Kernel Size 7 7 7 Mapping Depth 1 2 2 Mapping Width 768 768 768 Data Sigma 0.5 0.5 0.5 Sigma Range [1e-3, 1e3] [1e-3, 1e3] [1e-3, 1e3] Sigma Sampling Density interpolated cosine interpolated cosine interpolated cosine Augmentation Probability 0 0.12 0 Dropout Rate 0 [0, 0, 0, 0, 0.1] 0 Conditioning Dropout Rate 0.1 0.1 0.1 Optimizer Adam W Adam W Adam W Learning Rate 5e-4 5e-4 5e-4 Betas [0.9, 0.95] [0.9, 0.95] [0.9, 0.95] Eps 1e-8 1e-8 1e-8 Weight Decay 1e-2 1e-2 1e-2 EMA Decay 0.9999 0.9999 0.9999 Sampler DPM++(3M) SDE DPM++(3M) SDE DPM++(3M) SDE Sampling Steps 50 50 50 |