U-DiTs: Downsample Tokens in U-Shaped Diffusion Transformers

Authors: Yuchuan Tian, Zhijun Tu, Hanting Chen, Jie Hu, Chao Xu, Yunhe Wang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments to demonstrate the extraordinary performance of U-Di T models.
Researcher Affiliation Collaboration 1 State Key Lab of General AI, School of Intelligence Science and Technology, Peking University. 2 Huawei Noah s Ark Lab.
Pseudocode No The paper describes methods in prose and includes architectural diagrams (Figure 3) but does not contain a formal pseudocode or algorithm block.
Open Source Code Yes Codes are available at https://github.com/YuchuanTian/U-DiT.
Open Datasets Yes The training is conducted with the training set of Image Net 2012 [12].
Dataset Splits No The paper states 'The training is conducted with the training set of Image Net 2012 [12]' and evaluates on 'Image Net 256 256' and 'Image Net 512 512', but does not explicitly provide percentages or sample counts for training, validation, and test splits.
Hardware Specification Yes We used 8 NVIDIA A100s (80G) to train U-Di T-B and U-Di T-L models.
Software Dependencies No The paper mentions using 'sd-vae-ft-ema', 'Adam W optimizer', 'Mind Spore', 'CANN', and 'Ascend AI Processor', but does not provide specific version numbers for any of these software dependencies.
Experiment Setup Yes The same VAE (i.e. sd-vae-ft-ema) for latent diffusion models [29] and the Adam W optimizer is adopted. The training hyperparameters are kept unchanged, including global batch size 256, learning rate 1e 4, weight decay 0, and global seed 0.