Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers

Authors: Katherine Crowson, Stefan Andreas Baumann, Alex Birch, Tanishq Mathew Abraham, Daniel Z Kaplan, Enrico Shippole

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that HDi T performs competitively with existing models on Image Net 2562, and sets a new state-of-the-art for diffusion models on FFHQ-10242. Code is available at github.com/crowsonkb/k-diffusion.
Researcher Affiliation Collaboration 1Stability AI, United States 2Comp Vis @ LMU Munich, Germany 3Birchlabs, England, United Kingdom 4realiz.ai, New York, United States 5Independent Researcher, Florida, United States. Correspondence to: Katherine Crowson <crowsonkb@gmail.com>, Stefan Baumann <stefan.baumann@lmu.de>, Alex Birch <alex@birchlabs.co.uk>.
Pseudocode No The paper describes architectures and processes but does not include formal pseudocode or algorithm blocks.
Open Source Code Yes Code is available at github.com/crowsonkb/k-diffusion.
Open Datasets Yes We evaluate the proposed HDi T architecture on conditional and unconditional image generation, ablating over architectural choices (Section 5.2), and evaluating both megapixel pixel-space image generation (Section 5.3) and large-scale pixel-space image generation (Section 5.4). Training Unless mentioned otherwise, we train class-conditional models on Image Net (Deng et al., 2009) at a resolution of 128 128 directly on RGB pixels without any kind of latent representation.
Dataset Splits No The paper does not explicitly provide training/test/validation dataset splits. It mentions using 50k samples for FID computation, which typically refers to an evaluation set, but doesn't specify how the data was split for training/validation.
Hardware Specification Yes Training Hardware 4 A100 80Gi B 64 A100 80Gi B 8 H100 80Gi B
Software Dependencies No The paper mentions specific optimizers and samplers (e.g., Adam W, DPM++(3M) SDE sampling) but does not provide specific version numbers for software libraries or frameworks used (e.g., Python, PyTorch, TensorFlow, CUDA versions).
Experiment Setup Yes For further details, see Table 7. Parameter Image Net-1282 FFHQ-10242 Image Net-2562 Experiment Ablation E4 (Section 5.2) High-Res Synthesis (Section 5.3) Large-Scale (Section 5.4) Parameters 117M 85M 557M GFLOP/forward 31 206 198 Training Steps 400k 1M 2.2M Batch Size 256 256 256+5 Precision bfloat16 Training Hardware 4 A100 80Gi B 64 A100 80Gi B 8 H100 80Gi B Training Time 15 hours6 5 days6 7.6 days Patch Size 4 4 4 Levels (Local + Global Attention) 1 + 1 3 + 2 2 + 1 Depth [2, 11] [2, 2, 2, 2, 2] [2, 2, 16] Widths [384, 768] [128, 256, 384, 768, 1024] [384, 768, 1536] Attention Heads (Width / Head Dim) [6, 12] [2, 4, 6, 12, 16] [6, 12, 24] Attention Head Dim 64 64 64 Neighborhood Kernel Size 7 7 7 Mapping Depth 1 2 2 Mapping Width 768 768 768 Data Sigma 0.5 0.5 0.5 Sigma Range [1e-3, 1e3] [1e-3, 1e3] [1e-3, 1e3] Sigma Sampling Density interpolated cosine interpolated cosine interpolated cosine Augmentation Probability 0 0.12 0 Dropout Rate 0 [0, 0, 0, 0, 0.1] 0 Conditioning Dropout Rate 0.1 0.1 0.1 Optimizer Adam W Adam W Adam W Learning Rate 5e-4 5e-4 5e-4 Betas [0.9, 0.95] [0.9, 0.95] [0.9, 0.95] Eps 1e-8 1e-8 1e-8 Weight Decay 1e-2 1e-2 1e-2 EMA Decay 0.9999 0.9999 0.9999 Sampler DPM++(3M) SDE DPM++(3M) SDE DPM++(3M) SDE Sampling Steps 50 50 50