Exploring DCN-like architecture for fast image generation with arbitrary resolution

Authors: Shuai Wang, Zexian Li, Tianhui Song, Xubin Li, Tiezheng Ge, Bo Zheng, Limin Wang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments on 32x32 CIFAR10 and 256x256 Image Net datasets. The training batch size is set to 256. Similar to Si T [23] and Di T [12], we use Adam optimizer [31] with a constant learning rate 0.0001 during the whole training. We do not adopt any gradient clip techniques for fair comparison. For 32x32 CIFAR10 dataset, we train our model for 25000 steps. As for 256x256 Image Net dataset, we train for 1.5M steps. We use 8 A100 GPUs as the default training hardware. Flow DCN achieves the state-of-the-art 4.30 s FID on 256 256 Image Net Benchmark and comparable resolution extrapolation results, surpassing transformer-based counterparts in terms of convergence speed (only 1 5 images), visual quality, parameters (8% reduction) and FLOPs (20% reduction).
Researcher Affiliation Collaboration Shuai Wang Nanjing University Zexian Li Alibaba Group Tianhui Song Nanjing University Xubin Li Alibaba Group Tiezheng Ge Alibaba Group Bo Zheng Alibaba Group Limin Wang Nanjing University, Shanghai AI Lab
Pseudocode No No explicit pseudocode or algorithm block found in the paper.
Open Source Code No We plan to opensource our code and implementation later.
Open Datasets Yes We conduct experiments on 32x32 CIFAR10 and 256x256 Image Net datasets. The CIFAR10 dataset[35], comprising 50,000 32x32 small-resolution images from 10 distinct class categories, is considered an ideal benchmark to validate the design of our Multi Scale deformable block due to its relatively small scale.
Dataset Splits No We conduct experiments on 32x32 CIFAR10 and 256x256 Image Net datasets. The training batch size is set to 256. Similar to Si T [23] and Di T [12], we use Adam optimizer [31] with a constant learning rate 0.0001 during the whole training. We do not adopt any gradient clip techniques for fair comparison. For 32x32 CIFAR10 dataset, we train our model for 25000 steps. As for 256x256 Image Net dataset, we train for 1.5M steps.
Hardware Specification Yes We use 8 A100 GPUs as the default training hardware. FP16/FP32 results are collected on Nvidia A10 GPU.
Software Dependencies No No specific version numbers for general software dependencies like Python, PyTorch, or other libraries are provided.
Experiment Setup Yes The training batch size is set to 256. Similar to Si T [23] and Di T [12], we use Adam optimizer [31] with a constant learning rate 0.0001 during the whole training. ... For 32x32 CIFAR10 dataset, we train our model for 25000 steps. As for 256x256 Image Net dataset, we train for 1.5M steps. For sampling, we employ the Euler stochastic solver with 1000 sampling steps to generate images. To generate images, we employ an Euler-Maruyama solver with 250 steps for stochastic sampling. classifier-free guidance with 1.375