DiTFastAttn: Attention Compression for Diffusion Transformer Models

Authors: Zhihang Yuan, Hanling Zhang, Lu Pu, Xuefei Ning, Linfeng Zhang, Tianchen Zhao, Shengen Yan, Guohao Dai, Yu Wang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate Di TFast Attn using multiple Di T models, including Di T-XL (Peebles & Xie, 2023) and Pix Art-Sigma (Chen et al., 2024) for image generation, and Open-Sora (Open-Sora, 2024) for video generation. Our findings demonstrate that Di TFast Attn consistently reduces the computational cost. Notably, the higher the resolution, the greater the savings in computation and latency.
Researcher Affiliation Collaboration Zhihang Yuan 1,2 Hanling Zhang 1,2 Pu Lu 1 Xuefei Ning1 Linfeng Zhang3 Tianchen Zhao1,2 Shengen Yan2 Guohao Dai3,2 Yu Wang1 1Tsinghua University 2Infinigence AI 3Shanghai Jiao Tong University
Pseudocode Yes Algorithm 1: Method for Deciding the Compression Plan
Open Source Code No Project Website: http://nics-effalg.com/Di TFast Attn. The paper refers to a project website but does not provide a direct link to a source-code repository (e.g., GitHub, GitLab, Bitbucket) for the methodology, nor does it explicitly state that code is provided in supplementary material.
Open Datasets Yes For calculating quality metrics, we use Image Net as the evaluation dataset for Di T and MS-COCO as the evaluation dataset for Pix Art-Sigma. MS-COCO 2014 caption is used as text prompt for Pixart-Sigma models image generation.
Dataset Splits No The paper states it uses Image Net and MS-COCO for evaluation and generates 50k/30k images for quality metrics, but does not specify explicit train/validation/test dataset splits (percentages, sample counts, or citations to predefined splits) needed for data partitioning.
Hardware Specification Yes We measure the latency per sample on a single Nvidia A100 GPU.
Software Dependencies No The paper mentions software like 'Flash Attention-2 (Dao, 2023)', 'DPM-Solver', and 'IDDPM', but does not provide specific version numbers for these or any other key software components used in the experiments.
Experiment Setup Yes To demonstrate compatibility with fast sampling methods, we build our method upon 50-step DPM-Solver for Di T and Pixart-Sigma, and 200-step IDDPM (Nichol & Dhariwal, 2021) for Open-Sora. We use mean relative absolute error for L(O, O ) and experiment with and different thresholds δ at intervals of 0.025. We denote these threshold settings as D1 (δ=0.025), D2 (δ=0.05), ..., D6 (δ=0.15), respectively. We set the window size of WA-RS to 1/8 of the token size. ... a small positive constant ϵ (set to 10 6 in our experiments)... Di T runs with a batch size of 8, while Pix Art-Sigma models with a batch size of 1.