reproducibilityindex.ai

Evolving Subnetwork Training for Large Language Models

Authors: Hanqi Li, Lu Chen, Da Ma, Zijian Wu, Su Zhu, Kai Yu

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We apply EST to train GPT2 model and Tiny Llama model, resulting in 26.7% FLOPs saving for GPT2 and 25.0% for Tiny Llama without an increase in loss on the pre-training dataset. Moreover, EST leads to performance improvements in downstream tasks, indicating that it benefits generalization. Additionally, we provide intuitive theoretical studies based on training dynamics and Dropout theory to ensure the feasibility of EST.
Researcher Affiliation	Collaboration	1X-LANCE Lab, Department of Computer Science and Engineering, Mo E Key Lab of Artificial Intelligence, SJTU AI Institute, Shanghai Jiao Tong University, Shanghai, China 2Suzhou Laboratory, Suzhou, China 3AISpeech Co., Ltd., Suzhou, China.
Pseudocode	Yes	The pseudo-code of EST is as Algorithm 1.
Open Source Code	No	The paper does not provide an explicit statement or link for the open-sourcing of its code for the described methodology.
Open Datasets	Yes	We conduct experiments with GPT2-base model... pretrained on Open Web Text dataset (Radford et al., 2019) from scratch. We pre-train a 1.1B Tiny Llama model... on the subset of Slim Pajama dataset (Soboleva et al., 2023) and Starcoder dataset (Li et al., 2023a) from scratch
Dataset Splits	No	The paper mentions evaluating loss on a 'validation dataset' and shows loss curves for 'training and evaluation loss' (Fig 3, 4), but it does not specify exact percentages or sample counts for training/validation/test splits, nor does it cite predefined splits with specific details.
Hardware Specification	Yes	We use A100 80GB GPU to test both GPT2 model and Tiny Llama model.
Software Dependencies	No	The paper mentions using 'Adam W optimizer' but does not specify any software packages, libraries, or programming languages with their version numbers.
Experiment Setup	Yes	The batch size is set to 512 and the sequence length is 1024. The total training step is 150k. For GPT2-base model, the practical sampling scheduler is set to S = (20k, 70k, 150k) and P = [(0.5, 0.5, 0.5), (0.5, 0.5, 1), (1, 1, 1)]. The initial learning rate is set to 6e-4, followed by a linear learning rate decay. For Tiny Llama...batch size is set to 1024 and the sequence length is 2048. The total training step is 60k...max learning rate is set to 4e-4 with 2000 warm-up steps, followed by a cosine learning rate decay.