reproducibilityindex.ai

Collage: Light-Weight Low-Precision Strategy for LLM Training

Authors: Tao Yu, Gaurav Gupta, Karthick Gopalswamy, Amith R Mamidala, Hao Zhou, Jeffrey Huynh, Youngsuk Park, Ron Diamant, Anoop Deoras, Luke Huan

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that pre-training using COLLAGE removes the requirement of using 32-bit floating-point copies of the model and attains similar/better training performance compared to (16, 32)-bit mixed-precision strategy, with up to 3.7 speedup and 15% to 23% less memory usage in practice.
Researcher Affiliation	Collaboration	1Cornell University, Ithaca, NY 2AWS AI Labs, Santa Clara, CA 3AWS Annapurna Labs, Cupertino, CA 4AWS Sagemaker, Santa Clara, CA 5AWS AI Research and Education, Santa Clara, CA.
Pseudocode	Yes	Algorithm 1 Grow; Algorithm 2 COLLAGE: Bfloat16 MCF Adam W Optimization
Open Source Code	Yes	The code is available at https://github.com/ amazon-science/collage.
Open Datasets	Yes	We first pre-train the BERT-base-uncased, BERT-large-uncased, and Ro BERTa-base model with Hugging Face (HF) (Wolf et al., 2019) configuration on the Wikipedia-en corpus (Attardi, 2015), preprocessed with BERT Wordpiece tokenizer.
Dataset Splits	Yes	We split the dataset into train/val/test with the split ratio 980 : 10 : 10.
Hardware Specification	Yes	We use aws.p4.24xlarge compute instances for all of our experiments.
Software Dependencies	No	The paper mentions 'Py Torch' and 'Hugging Face' libraries (e.g., 'Py Torch BFloat16 Tensor', 'Hugging Face'), but does not provide specific version numbers for these software components.
Experiment Setup	Yes	Additional training and hyerparameter details are described in Appendix E.2. Table 10. Pre-training hyperparameters used for BERT and Ro BERTa. Table 11. Some configs and hyper-parameters of GPT models and Open LLa MA-7B.