Preparing Lessons for Progressive Training on Language Models

Authors: Yu Pan, Ye Yuan, Yichun Yin, Jiaxin Shi, Zenglin Xu, Ming Zhang, Lifeng Shang, Xin Jiang, Qun Liu

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiment In this section, we conduct experiments to validate the performance of our proposed method.
Researcher Affiliation Collaboration 1Harbin Institute of Technology Shenzhen, Shenzhen, Guangdong, China 2Pengcheng Laboratory, Shenzhen, China 3School of Computer Science, Peking University, Beijing, China 4Peking University-Anker Embodied AI Lab 5Huawei Noah s Ark Lab, Shenzhen, Guangdong, China 6Cloud BU, Huawei Technologies
Pseudocode Yes Algorithm 1: Process of Apollo
Open Source Code No The paper does not provide any explicit statements about releasing source code, nor does it include a link to a code repository.
Open Datasets Yes The training dataset is a concatenation of English Wikipedia and Toronto Book Corpus (Zhu et al. 2015).
Dataset Splits No The paper mentions using "500 data samples for validation" for a specific analysis (Experiment on Expanding Method), but it does not provide specific train/validation/test splits (e.g., percentages or exact counts) for the main model training on BERT or GPT, instead relying on standard benchmarks without explicitly detailing their splits.
Hardware Specification No The paper does not specify any hardware details such as CPU, GPU models, or memory used for the experiments.
Software Dependencies No The paper mentions using "Adam W as the optimizer" but does not specify version numbers for Adam W or any other software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used.
Experiment Setup Yes We use Adam W as the optimizer with a learning rate of 10 4 and weight decay of 10 2 in all the experiments. We chose the training batch sizes of 768 and 512 for BERT (Devlin et al. 2019) and GPT (Radford et al. 2019) models, respectively. ... Layer numbers of Apollo are [1, 3, 6, 12] and change at epoch [2, 4, 10]. Li GO is warmly trained for 100 steps as claimed in the original paper (Wang et al. 2023b). The training epoch is 35.