Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching

Authors: Xinyin Ma, Gongfan Fang, Michael Bi Mi, Xinchao Wang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that L2C largely outperforms samplers such as DDIM and DPM-Solver, alongside prior cache-based methods at the same inference speed. We explore our methods on two commonly used transformer architectures: Di T [46] and U-Vi T [3]. Specifically, we use Di T-XL/2 (256 256), Di T-XL/2 (512 512), Di T-L/2 and U-Vi T-H/2. We measure the image quality with Frechet Inception Distance(FID)[43], s FID[43], Inception Score[51], Precision and Recall[26]. Besides, we reported the total MACs and the latency to make a comparison of the acceleration ratio.
Researcher Affiliation Collaboration National University of Singapore1 Huawei Technologies Ltd.2
Pseudocode Yes Algorithm 1 Training and Algorithm 2 Sampling
Open Source Code Yes The code is available at https://github.com/horseee/learning-to-cache
Open Datasets Yes We take the training set of Image Net to train β for 1 epoch.
Dataset Splits No No explicit training/validation/test dataset splits with percentages, sample counts, or citations to predefined splits were found. The paper mentions using the 'training set of Image Net' but does not specify how it was partitioned for train/validation purposes within their workflow.
Hardware Specification Yes The training is conducted upon 8 A5000 GPUs with a global batch size equal to 64. The latency is tested when generating a batch of images(8 images) with classifier-free guidance on a single A5000, which we conducted five tests and took the average.
Software Dependencies No The paper mentions 'pytorch-Op Counter2' but does not specify version numbers for other key software components like Python, PyTorch, or CUDA, which are typically required for reproducibility.
Experiment Setup Yes The learning rate is set to 0.01 and Adam W optimizer is used to optimize β. The training is conducted upon 8 A5000 GPUs with a global batch size equal to 64. We take the training set of Image Net to train β for 1 epoch. Guidance strength is set to 0.4. Threshold θ would be set to discretize the βij to be either 0 or 1. Table 8: λ and θ for training the router.