Evolving Subnetwork Training for Large Language Models
Authors: Hanqi Li, Lu Chen, Da Ma, Zijian Wu, Su Zhu, Kai Yu
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We apply EST to train GPT2 model and Tiny Llama model, resulting in 26.7% FLOPs saving for GPT2 and 25.0% for Tiny Llama without an increase in loss on the pre-training dataset. Moreover, EST leads to performance improvements in downstream tasks, indicating that it benefits generalization. Additionally, we provide intuitive theoretical studies based on training dynamics and Dropout theory to ensure the feasibility of EST. |
| Researcher Affiliation | Collaboration | 1X-LANCE Lab, Department of Computer Science and Engineering, Mo E Key Lab of Artificial Intelligence, SJTU AI Institute, Shanghai Jiao Tong University, Shanghai, China 2Suzhou Laboratory, Suzhou, China 3AISpeech Co., Ltd., Suzhou, China. |
| Pseudocode | Yes | The pseudo-code of EST is as Algorithm 1. |
| Open Source Code | No | The paper does not provide an explicit statement or link for the open-sourcing of its code for the described methodology. |
| Open Datasets | Yes | We conduct experiments with GPT2-base model... pretrained on Open Web Text dataset (Radford et al., 2019) from scratch. We pre-train a 1.1B Tiny Llama model... on the subset of Slim Pajama dataset (Soboleva et al., 2023) and Starcoder dataset (Li et al., 2023a) from scratch |
| Dataset Splits | No | The paper mentions evaluating loss on a 'validation dataset' and shows loss curves for 'training and evaluation loss' (Fig 3, 4), but it does not specify exact percentages or sample counts for training/validation/test splits, nor does it cite predefined splits with specific details. |
| Hardware Specification | Yes | We use A100 80GB GPU to test both GPT2 model and Tiny Llama model. |
| Software Dependencies | No | The paper mentions using 'Adam W optimizer' but does not specify any software packages, libraries, or programming languages with their version numbers. |
| Experiment Setup | Yes | The batch size is set to 512 and the sequence length is 1024. The total training step is 150k. For GPT2-base model, the practical sampling scheduler is set to S = (20k, 70k, 150k) and P = [(0.5, 0.5, 0.5), (0.5, 0.5, 1), (1, 1, 1)]. The initial learning rate is set to 6e-4, followed by a linear learning rate decay. For Tiny Llama...batch size is set to 1024 and the sequence length is 2048. The total training step is 60k...max learning rate is set to 4e-4 with 2000 warm-up steps, followed by a cosine learning rate decay. |