Kaleido Diffusion: Improving Conditional Diffusion Models with Autoregressive Latent Modeling

Authors: Jiatao Gu, Ying Shen, Shuangfei Zhai, Yizhe Zhang, Navdeep Jaitly, Joshua Susskind

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We experiment on both class and text conditioned image generation benchmarks 1. We show that Kaleido not only outperforms standard diffusion models in terms of diversity but also maintains the high quality of the generated image. Additionally, the generated latents effectively control the characteristics of the generated images, ensuring that the image samples closely align with the intended latent variables. This modeling of latent tokens not only increases the diversity of image outputs but also provides a degree of interpretability and control over the image generation process. 4 Experiments
Researcher Affiliation Collaboration Apple University of Illinois Urbana-Champaign equal contribution {jgu32, szhai, yizzhang,njaitly, jsusskind}@apple.com ying22@illinois.edu
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No Code will be released after acceptance and internal review.
Open Datasets Yes For the former, we use Image Net [Deng et al., 2009], and we learn the textto-image models on CC12M [Changpinyo et al., 2021]
Dataset Splits No The paper does not explicitly specify training, validation, and test dataset splits needed to reproduce the experiment. It mentions evaluating metrics with '50K samples against the full training set' and using '10K samples' for diversity assessment, but does not detail how the data was split for model training and validation.
Hardware Specification Yes All experiments are performed on 64 A100 GPUs which takes roughly 2 weeks for training 400k steps for both Image Net and CC12M datasets.
Software Dependencies No The paper mentions using specific models like T5-XL and Qwen-VL-Chat, and frameworks like DDPM, but it does not specify the version numbers of general software dependencies such as Python, PyTorch, or CUDA.
Experiment Setup Yes default training config: batch_size =512 num_updates =400 _000 optimizer= adam adam_beta1 =0.9 adam_beta2 =0.99 adam_eps =1.e-8 learning_rate =1e-4 learning_rate_warmup_steps =10 _000 weight_decay =0.0 gradient_clip_norm =2.0 ema_decay =0.9999 mixed_precision_training =bp16