Collage: Light-Weight Low-Precision Strategy for LLM Training
Authors: Tao Yu, Gaurav Gupta, Karthick Gopalswamy, Amith R Mamidala, Hao Zhou, Jeffrey Huynh, Youngsuk Park, Ron Diamant, Anoop Deoras, Luke Huan
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that pre-training using COLLAGE removes the requirement of using 32-bit floating-point copies of the model and attains similar/better training performance compared to (16, 32)-bit mixed-precision strategy, with up to 3.7 speedup and 15% to 23% less memory usage in practice. |
| Researcher Affiliation | Collaboration | 1Cornell University, Ithaca, NY 2AWS AI Labs, Santa Clara, CA 3AWS Annapurna Labs, Cupertino, CA 4AWS Sagemaker, Santa Clara, CA 5AWS AI Research and Education, Santa Clara, CA. |
| Pseudocode | Yes | Algorithm 1 Grow; Algorithm 2 COLLAGE: Bfloat16 MCF Adam W Optimization |
| Open Source Code | Yes | The code is available at https://github.com/ amazon-science/collage. |
| Open Datasets | Yes | We first pre-train the BERT-base-uncased, BERT-large-uncased, and Ro BERTa-base model with Hugging Face (HF) (Wolf et al., 2019) configuration on the Wikipedia-en corpus (Attardi, 2015), preprocessed with BERT Wordpiece tokenizer. |
| Dataset Splits | Yes | We split the dataset into train/val/test with the split ratio 980 : 10 : 10. |
| Hardware Specification | Yes | We use aws.p4.24xlarge compute instances for all of our experiments. |
| Software Dependencies | No | The paper mentions 'Py Torch' and 'Hugging Face' libraries (e.g., 'Py Torch BFloat16 Tensor', 'Hugging Face'), but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | Additional training and hyerparameter details are described in Appendix E.2. Table 10. Pre-training hyperparameters used for BERT and Ro BERTa. Table 11. Some configs and hyper-parameters of GPT models and Open LLa MA-7B. |