Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Accelerating Parallel Diffusion Model Serving with Residual Compression

Authors: Jiajun Luo, Yicheng Xiao, Jianru Xu, Yangxiu You, Rongwei Lu, Chen Tang, Jingyan Jiang, Zhi Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 Experiments Models. Our method works with off-the-shelf models, we evaluate it on the state-of-the-art FLUX.1dev [2] for image generation, and on Cog Video X-2b [29] for video generation. 4.2 Main Results Compact Fusion achieves lower latency while maintaining high visual fidelity. It consistently performs well across different generation models, including FLUX.1-dev for images and Cog Video X for videos. It is also robust across hardware setups L20, H20, and A40 and across network conditions such as NVLink, PCIe, and simulated Ethernet. These results are shown in Figure 6. Quantitative metrics across 3, 4, and 6-GPU scales are provided in Tables 1 and 2. 4.3 Ablation Studies We perform targeted ablation studies to evaluate the contributions of core components and design choices in Compact Fusion.
Researcher Affiliation Collaboration Jiajun Luo Shenzhen International Graduate School Tsinghua University EMAIL Yicheng Xiao Southern University of Science and Technology EMAIL Jianru Xu Southern University of Science and Technology EMAIL Yangxiu You Jiangnan University EMAIL Rongwei Lu Shenzhen International Graduate School Tsinghua University EMAIL Chen Tang The Chinese University of Hong Kong EMAIL Jingyan Jiang Shenzhen Technology University EMAIL Zhi Wang Shenzhen International Graduate School Tsinghua University EMAIL
Pseudocode Yes Algorithm 1 Patch Parallel Input: Local tokens/features Xi on device i 1: Qi, Ki, Vi Project QKV(Xi) 2: Kgathered, Vgathered All Gather(Ki, Vi) 3: Oi Attention(Qi, Kgathered, Vgathered) 4: return Oi Algorithm 2 Ring Attention Input: Local tokens/features Xi on device i, total devices N 1: Qi, Ki, Vi Project QKV(Xi) 2: Initialize local output buffer: Oi 0 3: for s = 0 to N 1 do 4: K(s), V (s) Ring Send Recv(Ki, Vi, s) receive shard from device (i s) mod N 5: Oi Aggregate Oi, Attention Partial(Qi, K(s), V (s)) 6: end for 7: return Oi Algorithm 3 Subspace Iteration (rank-r) Input: A Rm n, target rank r, iterations T 1: Randomly Sample Q and orthonormalize: Q orthogonalize(Q). 2: for t = 1 to T do 3: Z A A Q Rn r 4: Q orthogonalize(Z) 5: end for 6: U A Q Rm r 7: U orthogonalize(U) 8: V Q Rn r 9: return U, V
Open Source Code Yes Portable implementation demonstrated on x Di T is publicly available at https://github.com/Cobalt-27/Compact Fusion. Our code is publicly released, and we provide full experimental settings(Section 4 and appendix E). We release our full codebase with detailed instructions, scripts for reproducing results. All used models and datasets are publicly available, and setup details are included in the paper.
Open Datasets Yes Dataset. We test the image generation model using prompts in COCO Captions 2014 dataset [30] and video generation using prompts sampled from VBench [31]. [30] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollรกr, and C. Lawrence Zitnick. Microsoft COCO captions: Data collection and evaluation server. Co RR, abs/1504.00325, 2015. [31] Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Vbench: Comprehensive benchmark suite for video generative models, 2023.
Dataset Splits Yes Dataset. For evaluation, we randomly sample 5000 prompts from the image validation set and 200 video validation prompts. Warmup Step. Like displaced parallel, Compact Fusion requires at least a 1-step warmup, where the uncompressed activation is used to initialize the base tensor for later residual computation (detailed in Appendix D). Figure 8 compares different methods and warmups. We observe that Compact Fusion maintains stable and high visual quality with just a single warmup step, showing little degradation compared to longer warmup.
Hardware Specification Yes Setting: 4 L20, FLUX-1.dev, 28-step, 1024 1024 resolution, 1-step warmup for all algorithms. On 4 H20 (NVLink) and 4 L20 (PCIe) clusters (Figure 1) Hardware. To demonstrate broad applicability, experiments are conducted on various hardware and interconnects: high-bandwidth NVLink (H20 clusters, bandwidth: 366 GB/s), standard PCIe (L20 clusters, bandwidth: 17.13 GB/s) and simulated lower-bandwidth Ethernet (A40 clusters using tc for traffic control), ensuring robustness evaluation under various deployment constraints.
Software Dependencies No The paper mentions using frameworks like "x Di T" and "distrifuser" but does not provide specific version numbers for these or any other software dependencies such as Python, PyTorch, or CUDA.
Experiment Setup Yes Setting: 4 L20, FLUX-1.dev, 28-step, 1024 1024 resolution, 1-step warmup for all algorithms. Models. Our method works with off-the-shelf models, we evaluate it on the state-of-the-art FLUX.1dev [2] for image generation, and on Cog Video X-2b [29] for video generation. We adhere to standard inference settings on x Di T, employing 28 steps for FLUX.1-dev and 50 steps for Cog Video X-2b. The default scheduler used is DPM solver. [37]