Quantized Distributed Training of Large Models with Convergence Guarantees

Authors: Ilia Markov, Adrian Vladu, Qi Guo, Dan Alistarh

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate this approach by training GPT-family models with up to 1.3 billion parameters on a multi-node cluster. Experiments show that QSDP preserves model accuracy, while completely removing the communication bottlenecks of FSDP, providing end-to-end speedups of up to 2.2x.
Researcher Affiliation Academia 1Institute of Science and Technology Austria 2CNRS 3IRIF, Universit e Paris Cit e 4Max Planck Institute for Informatics.
Pseudocode Yes Algorithm 1 Gradient-based Optimization of the Levels
Open Source Code No The paper mentions using and building upon existing frameworks (e.g., PyTorch FSDP, CGX, MosaicML examples) but does not provide an explicit statement of code release or a direct link to their specific QSDP implementation code.
Open Datasets Yes on the C4 dataset (Raffel et al., 2020).
Dataset Splits No The paper mentions 'Validation perplexity' and uses the C4 dataset, which has standard splits, but it does not explicitly state the dataset split percentages or sample counts for training, validation, and testing within the text.
Hardware Specification Yes We evaluate QSDP for training GPT-scale LLMs using multiple cloud-grade Amazon EC2 p3dn.24xlarge machines, with 8 V100 SXM2 GPUs each. Each GPU has 32GB memory.
Software Dependencies Yes We use the official NGC Py Torch 22.05-py3 Docker image with Py Torch 1.12, CUDA 11.6.2, NCCL 2.12, and the Mosaic ML Composer library (version 0.12), as well as a fork of the CGX communication library (Markov et al., 2022).
Experiment Setup Yes The global batch size for 125M and 350M models was 256, for 1.3B 512, resulting in 4 gradient accumulations at each iteration. For all models Adam W optimizer was used, the optimizer parameters are presented in the Table 4.