reproducibilityindex.ai

Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching

Authors: Xinyin Ma, Gongfan Fang, Michael Bi Mi, Xinchao Wang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that L2C largely outperforms samplers such as DDIM and DPM-Solver, alongside prior cache-based methods at the same inference speed. We explore our methods on two commonly used transformer architectures: Di T [46] and U-Vi T [3]. Specifically, we use Di T-XL/2 (256 256), Di T-XL/2 (512 512), Di T-L/2 and U-Vi T-H/2. We measure the image quality with Frechet Inception Distance(FID)[43], s FID[43], Inception Score[51], Precision and Recall[26]. Besides, we reported the total MACs and the latency to make a comparison of the acceleration ratio.
Researcher Affiliation	Collaboration	National University of Singapore1 Huawei Technologies Ltd.2
Pseudocode	Yes	Algorithm 1 Training and Algorithm 2 Sampling
Open Source Code	Yes	The code is available at https://github.com/horseee/learning-to-cache
Open Datasets	Yes	We take the training set of Image Net to train β for 1 epoch.
Dataset Splits	No	No explicit training/validation/test dataset splits with percentages, sample counts, or citations to predefined splits were found. The paper mentions using the 'training set of Image Net' but does not specify how it was partitioned for train/validation purposes within their workflow.
Hardware Specification	Yes	The training is conducted upon 8 A5000 GPUs with a global batch size equal to 64. The latency is tested when generating a batch of images(8 images) with classifier-free guidance on a single A5000, which we conducted five tests and took the average.
Software Dependencies	No	The paper mentions 'pytorch-Op Counter2' but does not specify version numbers for other key software components like Python, PyTorch, or CUDA, which are typically required for reproducibility.
Experiment Setup	Yes	The learning rate is set to 0.01 and Adam W optimizer is used to optimize β. The training is conducted upon 8 A5000 GPUs with a global batch size equal to 64. We take the training set of Image Net to train β for 1 epoch. Guidance strength is set to 0.4. Threshold θ would be set to discretize the βij to be either 0 or 1. Table 8: λ and θ for training the router.