Kaleido Diffusion: Improving Conditional Diffusion Models with Autoregressive Latent Modeling
Authors: Jiatao Gu, Ying Shen, Shuangfei Zhai, Yizhe Zhang, Navdeep Jaitly, Joshua Susskind
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experiment on both class and text conditioned image generation benchmarks 1. We show that Kaleido not only outperforms standard diffusion models in terms of diversity but also maintains the high quality of the generated image. Additionally, the generated latents effectively control the characteristics of the generated images, ensuring that the image samples closely align with the intended latent variables. This modeling of latent tokens not only increases the diversity of image outputs but also provides a degree of interpretability and control over the image generation process. 4 Experiments |
| Researcher Affiliation | Collaboration | Apple University of Illinois Urbana-Champaign equal contribution {jgu32, szhai, yizzhang,njaitly, jsusskind}@apple.com ying22@illinois.edu |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | Code will be released after acceptance and internal review. |
| Open Datasets | Yes | For the former, we use Image Net [Deng et al., 2009], and we learn the textto-image models on CC12M [Changpinyo et al., 2021] |
| Dataset Splits | No | The paper does not explicitly specify training, validation, and test dataset splits needed to reproduce the experiment. It mentions evaluating metrics with '50K samples against the full training set' and using '10K samples' for diversity assessment, but does not detail how the data was split for model training and validation. |
| Hardware Specification | Yes | All experiments are performed on 64 A100 GPUs which takes roughly 2 weeks for training 400k steps for both Image Net and CC12M datasets. |
| Software Dependencies | No | The paper mentions using specific models like T5-XL and Qwen-VL-Chat, and frameworks like DDPM, but it does not specify the version numbers of general software dependencies such as Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | default training config: batch_size =512 num_updates =400 _000 optimizer= adam adam_beta1 =0.9 adam_beta2 =0.99 adam_eps =1.e-8 learning_rate =1e-4 learning_rate_warmup_steps =10 _000 weight_decay =0.0 gradient_clip_norm =2.0 ema_decay =0.9999 mixed_precision_training =bp16 |