Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Accelerating Diffusion Transformers with Token-wise Feature Caching

Authors: Chang Zou, Xuyang Liu, Ting Liu, Siteng Huang, Linfeng Zhang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on Pix Art-α, Open Sora, Di T and FLUX demonstrate our effectiveness in both image and video generation with no requirements for training. For instance, 2.36 and 1.93 acceleration are achieved on Open Sora and Pix Art-α with almost no drop in generation quality. ... Abundant experiments on Pix Art-α, Open Sora, and Di T have been conducted, which demonstrates that To Ca achieves a high acceleration ratio while maintaining nearly lossless generation quality.
Researcher Affiliation Academia Chang Zou1,2 Xuyang Liu3 Ting Liu4 Siteng Huang5 Linfeng Zhang1 1Shanghai Jiao Tong University 2University of Electronic Science & Technology of China 3Sichuan University 4National University of Defense Technology 5Zhejiang University
Pseudocode Yes Algorithm 1 To Ca Input: current timestep t, current layer id l. 1: if current timestep t is a fresh step then 2: Fully compute Fl(x). 3: Cl(x) := Fl(x); # Update the cache. 4: else 5: S(xi) = P4 j=1 λj sj; # Compute the cache score for each token. 6: ICompute := Top K(S(xi), R%); # Fetch the index of computed tokens. 7: for all tokens xi do 8: if i ICompute then 9: Compute Fl(xi) through the neural layer. 10: Cl(xi) := Fl(xi); # Update the cache. 11: end if 12: end for 13: end if 14: return Fl(x). # return features for both cached and computed tokens for the next layer.
Open Source Code Yes Code: https://github.com/Shenyi-Z/ToCa ... Our codes have been released for further exploration in this domain.
Open Datasets Yes For text-to-image generation, we utilize 30,000 captions randomly selected from COCO-2017 (Lin et al., 2014) to generate an equivalent number of images. ... For class-conditional image generation, we uniformly sample from 1,000 classes in Image Net (Deng et al., 2009) to produce 50,000 images at a resolution of 256 × 256, evaluating performance using FID-50k (Heusel et al., 2017). Additionally, we employ s FID, Precision, and Recall as supplementary metrics. ... We leverage the VBench framework (Huang et al., 2024), generating 5 videos for each of the 950 benchmark prompts under different random seeds, resulting in a total of 4,750 videos.
Dataset Splits No The paper uses subsets or generated data for evaluation metrics (e.g., 30,000 captions from COCO-2017 to generate images, 50,000 images sampled from Image Net classes for evaluation, 5 videos for each of 950 VBench prompts). These describe the *evaluation setup* rather than explicit training/validation/test splits of a specific dataset for model training or general use.
Hardware Specification Yes We conduct experiments on three commonly-used Di T-based models across different generation tasks, including Pix Art-α (Chen et al., 2024a) for text-to-image generation, Open Sora (Zheng et al., 2024) for text-to-video generation, and Di T-XL/2 (Peebles & Xie, 2023) for class-conditional image generation with NVIDIA A800 80GB GPUs. ... All of our experiments were conducted on 6 A800 GPUs, each with 80GB of memory, running CUDA version 12.1. ... The CPUs used across all experiments were 84 v CPUs from an Intel(R) Xeon(R) Gold 6348 CPU @ 2.60GHz.
Software Dependencies Yes All of our experiments were conducted on 6 A800 GPUs, each with 80GB of memory, running CUDA version 12.1. The Di T model was executed in Python 3.12 with Py Torch version 2.4.0, while Pix Art-α and Open Sora were run in Python 3.9. The Py Torch version for Pix Art-α was 2.4.0, and for Open Sora it was 2.2.2.
Experiment Setup Yes For each model, we configure different average forced activation cycles N and average caching ratios R for To Ca as follows: Pix Art-α: N = 3 and R = 70%, Open Sora: N = 3 for temporal attention, spatial attention, MLP, and N = 6 for cross-attention, with R = 85% exclusively for MLP, and Di T: N = 4 and R = 93%. ... Each model utilizes its default sampling method: DPM-Solver++ (Lu et al., 2022b) with 20 steps for Pix Art-α, rflow (Liu et al., 2023) with 30 steps for Open Sora and DDPM (Ho et al., 2020) with 250 steps for Di T-XL/2. ... For Pix Art-α: We set the average forced activation cycle of To Ca to N = 2, supplemented with a dynamic adjustment parameter wt = 0.1. The parameter λt = 0.4 adjusts R at different time steps, and the average caching ratio is R = 70%. The parameter rl = 0.3 adjusts R at different depth layers. The module preference weight rtype = 1.0 shifts part of the computation from cross-attention layers to MLP layers.