SLAMB: Accelerated Large Batch Training with Sparse Communication

Authors: Hang Xu, Wenxuan Zhang, Jiawei Fei, Yuzhe Wu, Tingwen Xie, Jun Huang, Yuchen Xie, Mohamed Elhoseiny, Panos Kalnis

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical results show that, compared to the state-of-the-art, SLAMB transmits half the amount of data in large-batch BERT pretraining, without sacrificing accuracy. Moreover, SLAMB achieves excellent scalability in large computing infrastructures.
Researcher Affiliation Collaboration 1King Abdullah University of Science and Technology, Thuwal, Kingdom of Saudi Arabia 2Meituan, Beijing, China.
Pseudocode Yes Algorithm 1 SLAMB
Open Source Code Yes Code is available at https://github.com/hangxu0304/SLAMB
Open Datasets Yes We run the BERT-Large pre-training task on the dataset introduced by (Devlin et al., 2018), which is a concatenation of Wikipedia and Books Corpus with 2.5B and 800M words, respectively. ... We run image classification task on Image Net (Deng et al., 2009) dataset by using Swin Transformer Base model (Swin-B) (Liu et al., 2021) (88M parameters). ... We use the F1 score of SQu AD 1.1 3 fine-tuning task as the accuracy metric to evaluate the pre-trained models. 3https://rajpurpurkar.github.io/SQu AD-explorer/
Dataset Splits No The paper mentions 'BERT validation loss' in Table 2, implying a validation set was used, but it does not provide specific details on how the dataset was split into training, validation, and test sets (e.g., percentages, sample counts, or explicit splitting methodology).
Hardware Specification Yes We use two different types of clusters in our experiment: (i) a V100 cluster on Amazon EC2 cloud, where each machine is equipped with 4 NVIDIA V100 GPUs (16GB memory each) and 10Gbps network (i.e., P3.16xlarge instance), with an optional upgrade to 8 GPUs per node and 100Gbps network; and (ii) an A100 cluster, where each machine has 8 NVIDIA A100 GPUs (80GB memory each) and 100Gbps Infiniband network. For both types of clusters, the GPUs within each node are connected with high speed NVLink.
Software Dependencies No The paper mentions using 'mixed-precision training' and that 'All optimizers tested here use the same level of implementation (native PyTorch API)'. However, it does not specify version numbers for PyTorch, CUDA, or other software libraries, making it difficult to reproduce the exact software environment.
Experiment Setup Yes For SLAMB, we use the following hyper-parameter settings: β1 = 0.9, β2 = 0.999, β3 = 0.93, compression ratio k = 0.1, model synchronization interval H = 100. ... Detailed hyper-parameter setting are shown in Appendix G.1. For phase1 with seqlen 128, we set learning rate (LR) to 6e-3, and use linear warm-up and polynomial decay (degree=0.5) LR scheduler. The LR warm up steps and total steps are 2000 and 7038, respectively.