Train Faster, Perform Better: Modular Adaptive Training in Over-Parameterized Models

Authors: Yubin Shi, Yixuan Chen, Mingzhi Dong, Xiaochen Yang, Dongsheng Li, Yujiang Wang, Robert Dick, Qin Lv, Yingying Zhao, Fan Yang, Tun Lu, Ning Gu, Li Shang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that MAT nearly halves the computational cost of model training and outperforms the accuracy of baselines.
Researcher Affiliation Collaboration 1China and Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University 2School of Mathematics Statistics, University of Glasgow 3Microsoft Research Asia, Shanghai, China 4Department of Engineering Science, University of Oxford 5Department of Electrical Engineering and Computer Science, University of Michigan 6Department of Computer Science, University of Colorado Boulder 7School of Microelectronics, Fudan University
Pseudocode Yes Algorithm 1 Modular Adaptive Training
Open Source Code No The paper refers to third-party tools' GitHub repositories (e.g., 'https://github.com/pnnl/torchntk' and 'https://github.com/microsoft/Deep Speed') that they utilized for their experiments, but it does not provide a link to their own source code for the methodology described in the paper.
Open Datasets Yes Following the basic setup of Liu et al. (2019), we train BERT from scratch by masked language modeling (MLM) task on Wiki Text-2 (Merity et al., 2016). ... train Switch-Transformers using vanilla, Multirate and Switch-Rand training methods on Wiki Text-103 (Merity et al., 2016). ... We take classic convolutional network VGG16 as an example, which is over-parameterized for the CIFAR-10 dataset.
Dataset Splits No The paper mentions 'validation loss' and 'validation perplexity' in the results, implying the use of a validation set. However, it does not provide explicit details about the specific training/validation/test split ratios, sample counts, or a citation to a predefined split used in their experiments for reproducibility.
Hardware Specification Yes All experiments are conducted on 8 NVIDIA Ge Force RTX 3090 GPUs.
Software Dependencies No The paper mentions leveraging 'the implementation of Engel et al. (2022)2' (torchntk) and measuring FLOPs using 'the Deep Speed Profiler3' but does not specify version numbers for these or any other software dependencies.
Experiment Setup Yes All experiments are conducted on 8 NVIDIA Ge Force RTX 3090 GPUs. For further experimental details, please refer to the Appendix. ... Table 4: Hyperparameters configuration in BERT and Switch-Transformer.