Deep Fusion: Efficient Network Training via Pre-trained Initializations

Authors: Hanna Mazzawi, Javier Gonzalvo, Michael Wunder, Sammy Jerome, Benoit Dherin

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show how Deep Fusion is a practical and effective approach that not only accelerates the training process but also reduces computational requirements, maintaining or surpassing traditional training methods performance in various NLP tasks and T5 model sizes.
Researcher Affiliation Industry Hanna Mazzawi 1 Xavi Gonzalvo 1 Michael Wunder 1 Sammy Jerome 1 Benoit Dherin 2 1Google Research, New York, NY, USA 2Google, Sunnyvale, CA, USA.
Pseudocode No The paper defines the FUSION operator using mathematical equations (Eq. 5) but does not provide structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any concrete access to source code for the methodology described.
Open Datasets Yes We begin by training T5 language models on the C4 dataset. ... We fine-tuned high performing settings from the first experiment together with a baseline on NLP tasks using the GLUE benchmark.
Dataset Splits No The paper mentions 'validation data' and 'evaluation accuracy' but does not provide specific details on the train/validation/test splits (percentages, counts, or explicit standard split names).
Hardware Specification Yes Table 1. Performance of different T5-Medium fusion methods at 1 million steps, replicated three times for standard deviation. Cost is in TPU V3 4x4 hours.
Software Dependencies No The paper mentions models like T5 and the use of TPUs, but does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes We trained the following 4 experiments (see dimensionalities in Table 8 in Appendix B)... Every model (T5-S, T5-M, T5-L) is trained 1M steps. ... our experiments will show how the post-fusion learning rate affects the performance of the learning, as well as the parameters. To understand how the learning rate affects performance, we ran the normal T5 learning rate schedule with various offsets.