Deep Fusion: Efficient Network Training via Pre-trained Initializations
Authors: Hanna Mazzawi, Javier Gonzalvo, Michael Wunder, Sammy Jerome, Benoit Dherin
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show how Deep Fusion is a practical and effective approach that not only accelerates the training process but also reduces computational requirements, maintaining or surpassing traditional training methods performance in various NLP tasks and T5 model sizes. |
| Researcher Affiliation | Industry | Hanna Mazzawi 1 Xavi Gonzalvo 1 Michael Wunder 1 Sammy Jerome 1 Benoit Dherin 2 1Google Research, New York, NY, USA 2Google, Sunnyvale, CA, USA. |
| Pseudocode | No | The paper defines the FUSION operator using mathematical equations (Eq. 5) but does not provide structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any concrete access to source code for the methodology described. |
| Open Datasets | Yes | We begin by training T5 language models on the C4 dataset. ... We fine-tuned high performing settings from the first experiment together with a baseline on NLP tasks using the GLUE benchmark. |
| Dataset Splits | No | The paper mentions 'validation data' and 'evaluation accuracy' but does not provide specific details on the train/validation/test splits (percentages, counts, or explicit standard split names). |
| Hardware Specification | Yes | Table 1. Performance of different T5-Medium fusion methods at 1 million steps, replicated three times for standard deviation. Cost is in TPU V3 4x4 hours. |
| Software Dependencies | No | The paper mentions models like T5 and the use of TPUs, but does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | We trained the following 4 experiments (see dimensionalities in Table 8 in Appendix B)... Every model (T5-S, T5-M, T5-L) is trained 1M steps. ... our experiments will show how the post-fusion learning rate affects the performance of the learning, as well as the parameters. To understand how the learning rate affects performance, we ran the normal T5 learning rate schedule with various offsets. |