Collage: Light-Weight Low-Precision Strategy for LLM Training

Authors: Tao Yu, Gaurav Gupta, Karthick Gopalswamy, Amith R Mamidala, Hao Zhou, Jeffrey Huynh, Youngsuk Park, Ron Diamant, Anoop Deoras, Luke Huan

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that pre-training using COLLAGE removes the requirement of using 32-bit floating-point copies of the model and attains similar/better training performance compared to (16, 32)-bit mixed-precision strategy, with up to 3.7 speedup and 15% to 23% less memory usage in practice.
Researcher Affiliation Collaboration 1Cornell University, Ithaca, NY 2AWS AI Labs, Santa Clara, CA 3AWS Annapurna Labs, Cupertino, CA 4AWS Sagemaker, Santa Clara, CA 5AWS AI Research and Education, Santa Clara, CA.
Pseudocode Yes Algorithm 1 Grow; Algorithm 2 COLLAGE: Bfloat16 MCF Adam W Optimization
Open Source Code Yes The code is available at https://github.com/ amazon-science/collage.
Open Datasets Yes We first pre-train the BERT-base-uncased, BERT-large-uncased, and Ro BERTa-base model with Hugging Face (HF) (Wolf et al., 2019) configuration on the Wikipedia-en corpus (Attardi, 2015), preprocessed with BERT Wordpiece tokenizer.
Dataset Splits Yes We split the dataset into train/val/test with the split ratio 980 : 10 : 10.
Hardware Specification Yes We use aws.p4.24xlarge compute instances for all of our experiments.
Software Dependencies No The paper mentions 'Py Torch' and 'Hugging Face' libraries (e.g., 'Py Torch BFloat16 Tensor', 'Hugging Face'), but does not provide specific version numbers for these software components.
Experiment Setup Yes Additional training and hyerparameter details are described in Appendix E.2. Table 10. Pre-training hyperparameters used for BERT and Ro BERTa. Table 11. Some configs and hyper-parameters of GPT models and Open LLa MA-7B.