Go Wide, Then Narrow: Efficient Training of Deep Thin Networks

Authors: Denny Zhou, Mao Ye, Chen Chen, Tianjian Meng, Mingxing Tan, Xiaodan Song, Quoc Le, Qiang Liu, Dale Schuurmans

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We also conduct large-scale empirical experiments to validate the proposed method. By training with our method, Res Net50 can outperform Res Net101, and BERTBASE can be comparable with BERTLARGE, when Res Net101 and BERTLARGE are trained under the standard training procedures as in the literature.
Researcher Affiliation Collaboration 1Google Brain, USA 2Department of Computer Science, The University of Texas at Austin, USA.
Pseudocode Yes We summarize our method in Algorithm 1. Here are several additional remarks. First, our method does not have to be more expensive than the normal knowledge distillation method since the imitation training in our method is quite light in each layer. In addition, we often can use an existing trained big model as the wide network in our method. Consequently, the wide learning stage of our method can be skipped. Moreover, the layers in our method do not have to exactly align to the layers of the trained neural models. For example, the wide network may be a 24-layer BERTLARGE
Open Source Code No No explicit statement or link for open-source code for the described methodology was found.
Open Datasets Yes We train the widely used Res Net models (He et al., 2016) on the Image Net dataset (Russakovsky et al., 2015) using our apporach and baseline methods. Following Devlin et al. (2019), we firstly pre-train the BERT model using Books Corpus (Zhu et al., 2015) and the Wikipedia corpus. Then we fine-tune this pre-trained model and evaluate on the Stanford Question Answering Dataset (SQu AD) 1.1 and 2.0 (Rajpurkar et al., 2016).
Dataset Splits Yes We evaluate the models using the SQu AD 1.1 and 2.0 datasets. Results are shown in Table 4. Note that BERTBASE trained using our vanilla setting here outperforms BERTBASE (Devlin et al., 2019) by a large margin.
Hardware Specification No No specific hardware details (like GPU/CPU models, memory, or cloud instance types) were mentioned for running experiments.
Software Dependencies No No specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9) were mentioned.
Experiment Setup Yes We follow the training settings in (He et al., 2016). Each Res Net variant is trained with 90 epochs using SGD with momentum 0.9, batch norm decay 0.9, weight decay 1e-4, and batch size 256. The learning rate is linearly increased from 0 to 0.1 in the first 5 epochs, and then reduced by 10x at epoch 30, 60 and 80. In the pre-training phrase, we train the model on the masked language modeling (MLM) and next sentence prediction (NSP) tasks using Book Corpus and Wikipedia corpus for 1 million steps with batch size of 512 and sequence length of 512. We use the Adam optimizer with the learning rate of 1e-4, β1 = 0.9, β2 = 0.999, weight decay of 0.01. The learning rate is linearly warmed up in the first 10,000 steps, and then linearly decayed.