One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale
Authors: Fan Bao, Shen Nie, Kaiwen Xue, Chongxuan Li, Shi Pu, Yaole Wang, Gang Yue, Yue Cao, Hang Su, Jun Zhu
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present the experimental setup in Section 6.1. We show the ability of Uni Diffuser to perform multiple generation tasks and directly compare it with existing large models in Section 6.2. We further demonstrate that Uni Diffuser naturally supports applications like data variation, blocked Gibbs sampling between modalities (see Section 6.3), and interpolation between images in the wild (see Section 6.4). |
| Researcher Affiliation | Collaboration | 1Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua Huawei Joint Center for AI, BNRist Center, State Key Lab for Intell. Tech. & Sys., Tsinghua University 2Sheng Shu, Beijing, China 3Gaoling School of AI, Renmin University of China; Beijing Key Lab of Big Data Management and Analysis Methods, Beijing, China 4Beijing Academy of Artificial Intelligence 5Pazhou Laboratory (Huangpu), Guangzhou, China. |
| Pseudocode | Yes | Algorithm 1 Training |
| Open Source Code | Yes | Our code is available at https://github.com/ thu-ml/unidiffuser. |
| Open Datasets | Yes | We use three subsets of LAION-5B (Schuhmann et al., 2022) following Stable Diffusion (Rombach et al., 2022). |
| Dataset Splits | No | We reduce the learning rate by a factor of 10 and continue training whenever the validation loss does not decrease. |
| Hardware Specification | Yes | The training takes around 28 days on 88 A100 (80GB) GPUs. |
| Software Dependencies | No | We use DPM-Solver (Lu et al., 2022b;c) with 50 steps in all experiments. |
| Experiment Setup | Yes | In the first stage, we train 250K steps at 256 256 resolution on laion2B-en with a batch size of 11264 and 5K warm-up steps. In the second stage, we fine-tune the model with 200K steps at 512 512 resolution on laion-high-resolution with a batch size of 2112 and 5K warm-up steps. In the last stage, we resume from the last checkpoint of the second stage (including both weights of the model and states of the optimizer), and train 220K steps at 512 512 resolution on laionaesthetics v2 5+ with a batch size of 2112. Following Bao et al. (2023a), we use the Adam W optimizer (Loshchilov & Hutter, 2019) with a learning rate of 2e-4, a weight decay of 0.03 and running coefficients of (β1, β2) = (0.9, 0.9) in all stages. We reduce the learning rate by a factor of 10 and continue training whenever the validation loss does not decrease. We train with mixed precision for efficiency. When U-Vi T is trained at 256 256 resolution, we interpolate the positional embeddings related to images via bilinear interpolation. The training takes around 28 days on 88 A100 (80GB) GPUs. We use DPM-Solver (Lu et al., 2022b;c) with 50 steps in all experiments. |