PTQ4DiT: Post-training Quantization for Diffusion Transformers
Authors: Junyi Wu, Haoxuan Wang, Yuzhang Shang, Mubarak Shah, Yan Yan
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments demonstrate that our PTQ4Di T successfully quantizes Di Ts to 8-bit precision (W8A8) while preserving comparable generation ability and further enables effective quantization to 4-bit weight precision (W4A8) for the first time. |
| Researcher Affiliation | Academia | 1University of Illinois Chicago 2Illinois Institute of Technology 3University of Central Florida |
| Pseudocode | Yes | Algorithm 1 Post-Training Quantization for Diffusion Transformers (PTQ4Di T) |
| Open Source Code | Yes | https://github.com/adreamwu/PTQ4Di T |
| Open Datasets | Yes | We evaluate PTQ4Di T on the Image Net dataset [41] |
| Dataset Splits | Yes | To construct the calibration set, we uniformly select 25 timesteps for 256-resolution experiments and 10 timesteps for 512-resolution experiments, generating 32 samples at each selected timestep. |
| Hardware Specification | Yes | For instance, generating a 512 512 resolution image using Di Ts can take more than 20 seconds and 105 Gflops on an NVIDIA RTX A6000 GPU. ... Our code is based on Py Torch [36], and all experiments are conducted on NVIDIA RTX A6000 GPUs. |
| Software Dependencies | No | The paper states 'Our code is based on Py Torch [36]', but does not specify a version number for PyTorch or any other software dependencies. |
| Experiment Setup | Yes | Our experimental setup is similar to the original study of Diffusion Transformers (Di Ts) [37]. We evaluate PTQ4Di T on the Image Net dataset [41], using pre-trained class-conditional Di T-XL/2 models [37] at image resolutions of 256 256 and 512 512. The DDPM solver [17] with 250 sampling steps is employed for the generation process. To further assess the robustness of our method, we conduct additional experiments with reduced sampling steps of 100 and 50. For fair benchmarking, all methods utilize uniform quantizers for all activations and weights, with channel-wise quantization for weights and tensor-wise for activations, unless specified otherwise. To construct the calibration set, we uniformly select 25 timesteps for 256-resolution experiments and 10 timesteps for 512-resolution experiments, generating 32 samples at each selected timestep. The optimization of quantization parameters follows the implementation from Q-Diffusion [18]. |