Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching
Authors: Xinyin Ma, Gongfan Fang, Michael Bi Mi, Xinchao Wang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that L2C largely outperforms samplers such as DDIM and DPM-Solver, alongside prior cache-based methods at the same inference speed. We explore our methods on two commonly used transformer architectures: Di T [46] and U-Vi T [3]. Specifically, we use Di T-XL/2 (256 256), Di T-XL/2 (512 512), Di T-L/2 and U-Vi T-H/2. We measure the image quality with Frechet Inception Distance(FID)[43], s FID[43], Inception Score[51], Precision and Recall[26]. Besides, we reported the total MACs and the latency to make a comparison of the acceleration ratio. |
| Researcher Affiliation | Collaboration | National University of Singapore1 Huawei Technologies Ltd.2 |
| Pseudocode | Yes | Algorithm 1 Training and Algorithm 2 Sampling |
| Open Source Code | Yes | The code is available at https://github.com/horseee/learning-to-cache |
| Open Datasets | Yes | We take the training set of Image Net to train β for 1 epoch. |
| Dataset Splits | No | No explicit training/validation/test dataset splits with percentages, sample counts, or citations to predefined splits were found. The paper mentions using the 'training set of Image Net' but does not specify how it was partitioned for train/validation purposes within their workflow. |
| Hardware Specification | Yes | The training is conducted upon 8 A5000 GPUs with a global batch size equal to 64. The latency is tested when generating a batch of images(8 images) with classifier-free guidance on a single A5000, which we conducted five tests and took the average. |
| Software Dependencies | No | The paper mentions 'pytorch-Op Counter2' but does not specify version numbers for other key software components like Python, PyTorch, or CUDA, which are typically required for reproducibility. |
| Experiment Setup | Yes | The learning rate is set to 0.01 and Adam W optimizer is used to optimize β. The training is conducted upon 8 A5000 GPUs with a global batch size equal to 64. We take the training set of Image Net to train β for 1 epoch. Guidance strength is set to 0.4. Threshold θ would be set to discretize the βij to be either 0 or 1. Table 8: λ and θ for training the router. |