Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Re-ttention: Ultra Sparse Visual Generation via Attention Statistical Reshape

Authors: Ruichen Chen, Keith Mills, Liyao Jiang, Chao Gao, Di Niu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on T2V/T2I models such as Cog Video X and the Pix Art Di Ts demonstrate that Re-ttention requires as few as 3.1% of the tokens during inference, outperforming contemporary methods like Fast Di TAttn, Sparse Video Gen and MInference.
Researcher Affiliation Collaboration Ruichen Chen ECE Department University of Alberta EMAIL Keith G. Mills Division of CSE Louisiana State University EMAIL Liyao Jiang ECE Department University of Alberta EMAIL Chao Gao Huawei Technologies Edmonton, Alberta, Canada EMAIL Di Niu ECE Department University of Alberta EMAIL
Pseudocode No The paper describes the proposed method, Re-ttention, through textual explanation and mathematical equations (e.g., Equation 8, 9, 10), but does not include a dedicated pseudocode block or algorithm figure.
Open Source Code Yes Justification: Code is included as supplementary material.
Open Datasets Yes We evaluate Re-ttention on both the text-to-video (T2V) and text-to-image (T2I) tasks using a number of Di T models, such as Cog Video X (2B) [47], Pix Art-α/Σ (0.6B) [3, 2] and Hunyuan-Di T (1.6B) [23]. ... We perform quantitative T2V evaluation using the Animal and Architecture categories of VBench [16]... We evaluate T2I performance on a comprehensive set benchmark metrics: Gen Eval [12], HPSv2 [41], and MS-COCO 2014 [25].
Dataset Splits Yes We perform quantitative T2V evaluation using the Animal and Architecture categories of VBench [16], which consist of 100 videos each. ... HPSv2 consists of four image categories: Animation, Concept-art, Painting and Photos. Each category consists of 800 images for 3.2k generations in total. Finally, we generate 10k images using the MS-COCO 2014 validation set and measure the LPIPS score [52], Image Reward (IR) [45] and CLIP score [13] using the Vi T-B/16 backbone.
Hardware Specification No Although we did not implement a custom GPU kernel, we measured inference latency on typical GPUs and observed that Re-tention achieves comparable runtime to Di TFast Attn across all tested models. This demonstrates that our contributions do not impose significant computational overhead, confirming that Re-tention maintains both high sparsity and practical efficiency.
Software Dependencies No Specifically, we use the Hugging Face Diffusers library [39] to instantiate the base Di T models and consider the default values for inference parameters like the classifier-free guidance (CFG) scale and number of denoising steps 50 for Cog Video X/Hunyuan and 20 for the Pix Art Di Ts.
Experiment Setup Yes Specifically, we use the Hugging Face Diffusers library [39] to instantiate the base Di T models and consider the default values for inference parameters like the classifier-free guidance (CFG) scale and number of denoising steps 50 for Cog Video X/Hunyuan and 20 for the Pix Art Di Ts. Following prior literature on Di T acceleration [42, 53, 22, 27], we apply the full attention during the first 5, 10 or 15 steps for the Pix Art Di Ts, Hunyuan and Cog Video X models, respectively, and then apply the sparse attention mechanism for the remainder of the denoising process. Further, we set a caching period of 5 steps for Di TFast Attn and Re-ttenion, where we perform full attention to cache required statistics.