UltraPixel: Advancing Ultra High-Resolution Image Synthesis to New Peaks

Authors: Jingjing Ren, Wenbo Li, Haoyu Chen, Renjing Pei, Bin Shao, Yong Guo, Long Peng, Fenglong Song, Lei Zhu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our model achieves fast training with reduced data requirements, producing photo-realistic high-resolution images and demonstrating state-of-the-art performance in extensive experiments.
Researcher Affiliation Collaboration Jingjing Ren1 , Wenbo Li2 , Haoyu Chen1, Renjing Pei2, Bin Shao2, Yong Guo3, Long Peng2, Fenglong Song2, Lei Zhu1,4 1HKUST (Guangzhou) 2Huawei Noah s Ark Lab 3MPI 4HKUST
Pseudocode No The paper includes diagrams and descriptions of its process but does not contain a formal pseudocode or algorithm block.
Open Source Code Yes Project page: https://jingjingrenabc.github.io/ultrapixel. The code repository link is provided in the home page https:// jingjingrenabc.github.io/ultrapixel/.
Open Datasets Yes We train models on 1M images of varying resolutions and aspect ratios, ranging from 1024 to 4608, sourced from LAION-Aesthetics [44], SAM [24], and self-collected high-quality dataset.
Dataset Splits No The paper mentions training on 1M images and evaluating on 1,000 images but does not explicitly detail training, validation, and test dataset splits with percentages or sample counts.
Hardware Specification Yes The training is conducted on 8 A100 GPUs with a batch size of 64.
Software Dependencies No The paper mentions using Adam W optimizer but does not specify version numbers for any software dependencies or libraries.
Experiment Setup Yes The training is conducted on 8 A100 GPUs with a batch size of 64. We employ the Adam W optimizer [30] with a learning rate of 0.0001. During training, we use continuous timesteps in [0, 1] as [36], while LR guidance is consistently corrupted with noise at timestep t = 0.05. During inference, the generative model uses 20 sampling steps, and the diffusion decoding model uses 10 steps. We adopt DDIM [45] with a classifier-free guidance [19] weight of 4 for latent generation and 1.1 for diffusion decoding.