SLAMB: Accelerated Large Batch Training with Sparse Communication
Authors: Hang Xu, Wenxuan Zhang, Jiawei Fei, Yuzhe Wu, Tingwen Xie, Jun Huang, Yuchen Xie, Mohamed Elhoseiny, Panos Kalnis
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical results show that, compared to the state-of-the-art, SLAMB transmits half the amount of data in large-batch BERT pretraining, without sacrificing accuracy. Moreover, SLAMB achieves excellent scalability in large computing infrastructures. |
| Researcher Affiliation | Collaboration | 1King Abdullah University of Science and Technology, Thuwal, Kingdom of Saudi Arabia 2Meituan, Beijing, China. |
| Pseudocode | Yes | Algorithm 1 SLAMB |
| Open Source Code | Yes | Code is available at https://github.com/hangxu0304/SLAMB |
| Open Datasets | Yes | We run the BERT-Large pre-training task on the dataset introduced by (Devlin et al., 2018), which is a concatenation of Wikipedia and Books Corpus with 2.5B and 800M words, respectively. ... We run image classification task on Image Net (Deng et al., 2009) dataset by using Swin Transformer Base model (Swin-B) (Liu et al., 2021) (88M parameters). ... We use the F1 score of SQu AD 1.1 3 fine-tuning task as the accuracy metric to evaluate the pre-trained models. 3https://rajpurpurkar.github.io/SQu AD-explorer/ |
| Dataset Splits | No | The paper mentions 'BERT validation loss' in Table 2, implying a validation set was used, but it does not provide specific details on how the dataset was split into training, validation, and test sets (e.g., percentages, sample counts, or explicit splitting methodology). |
| Hardware Specification | Yes | We use two different types of clusters in our experiment: (i) a V100 cluster on Amazon EC2 cloud, where each machine is equipped with 4 NVIDIA V100 GPUs (16GB memory each) and 10Gbps network (i.e., P3.16xlarge instance), with an optional upgrade to 8 GPUs per node and 100Gbps network; and (ii) an A100 cluster, where each machine has 8 NVIDIA A100 GPUs (80GB memory each) and 100Gbps Infiniband network. For both types of clusters, the GPUs within each node are connected with high speed NVLink. |
| Software Dependencies | No | The paper mentions using 'mixed-precision training' and that 'All optimizers tested here use the same level of implementation (native PyTorch API)'. However, it does not specify version numbers for PyTorch, CUDA, or other software libraries, making it difficult to reproduce the exact software environment. |
| Experiment Setup | Yes | For SLAMB, we use the following hyper-parameter settings: β1 = 0.9, β2 = 0.999, β3 = 0.93, compression ratio k = 0.1, model synchronization interval H = 100. ... Detailed hyper-parameter setting are shown in Appendix G.1. For phase1 with seqlen 128, we set learning rate (LR) to 6e-3, and use linear warm-up and polynomial decay (degree=0.5) LR scheduler. The LR warm up steps and total steps are 2000 and 7038, respectively. |