Masked Structural Growth for 2x Faster Language Model Pre-training
Authors: Yiqun Yao, Zheng Zhang, Jing Li, Yequan Wang
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that MSG is significantly faster than related work: we achieve up to 2.2x speedup in pre-training different types of language models while maintaining comparable or better downstream performances. All the experiments are conducted on a single pod with 8 Nvidia A100 GPUs. We present Bert-Large results in Table 4. Bert-base and GPT-2 results are in Table 5. We try to answer the following questions with ablation studies: |
| Researcher Affiliation | Collaboration | 1Beijing Academy of Artificial Intelligence, Beijing, China 2Harbin Institute of Technology, Shenzhen, China |
| Pseudocode | No | No explicit pseudocode or algorithm blocks were found. The paper describes its method using textual descriptions and mathematical equations. |
| Open Source Code | Yes | Code is publicly available at https://github.com/cofe-ai/MSG. |
| Open Datasets | Yes | We train both Bert-base and Bert-large (Devlin et al., 2019) using a pre-processed combination of English Wikipedia and Book Corpus (Zhu et al., 2015), and GPT-2 (Radford et al., 2019) using Open Web Text (Gokaslan & Cohen, 2019). As for evaluation, we fine-tune our Bert models on the GLUE (Wang et al., 2018) and SQu ADv1.1 (Rajpurkar et al., 2018) tasks, and GPT models on Wikitext2 (Merity et al., 2016). |
| Dataset Splits | Yes | We report the mean and standard deviation of the metrics across 3 runs on the dev set. For GPT-2, we evaluate on the validation set of Wikitext2 (Merity et al., 2016). We report the zero-shot and fine-tuned perplexities on the validation set. |
| Hardware Specification | Yes | All the experiments are conducted on a single pod with 8 Nvidia A100 GPUs. The experiments are conducted on 2 pods with 14 Nvidia A100 GPUs. A GPT-like LLM is trained with 24 DGX-A800 GPU (8 80G) servers. |
| Software Dependencies | No | The paper mentions optimizers like Adam W and refers to other models (Bert, GPT-2) and implementations (Nano GPT) but does not list specific software dependencies with version numbers (e.g., 'PyTorch 1.9' or 'TensorFlow 2.x'). |
| Experiment Setup | Yes | We use a learning rate of (1e-4,1e-5) and warm-up step of (10k,30k) for (Bert-base, Bert-large), respectively, with a linear learning rate scheduler for both. The learning rate is reset to its maximum value at the 1st and 2nd growth stage, following (Gu et al., 2021). The batch size is set to 256 and maximum sequence length to 128. We clip the gradient norm to 1.0 for Bert-large. For GPT-2, we train with a sequence length of 1024 and a batch size of 140 for 240k steps. For all the GLUE tasks, we use a batch size of 32, sequence length of 128 and learning rate of 2e-5. We fine-tune for 5 epochs for small datasets... and 3 epochs for other tasks. For SQu AD, we fine-tune with a batch size of 12 and learning rate of 3e-5 for 2 epochs. |