Quantized Distributed Training of Large Models with Convergence Guarantees
Authors: Ilia Markov, Adrian Vladu, Qi Guo, Dan Alistarh
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate this approach by training GPT-family models with up to 1.3 billion parameters on a multi-node cluster. Experiments show that QSDP preserves model accuracy, while completely removing the communication bottlenecks of FSDP, providing end-to-end speedups of up to 2.2x. |
| Researcher Affiliation | Academia | 1Institute of Science and Technology Austria 2CNRS 3IRIF, Universit e Paris Cit e 4Max Planck Institute for Informatics. |
| Pseudocode | Yes | Algorithm 1 Gradient-based Optimization of the Levels |
| Open Source Code | No | The paper mentions using and building upon existing frameworks (e.g., PyTorch FSDP, CGX, MosaicML examples) but does not provide an explicit statement of code release or a direct link to their specific QSDP implementation code. |
| Open Datasets | Yes | on the C4 dataset (Raffel et al., 2020). |
| Dataset Splits | No | The paper mentions 'Validation perplexity' and uses the C4 dataset, which has standard splits, but it does not explicitly state the dataset split percentages or sample counts for training, validation, and testing within the text. |
| Hardware Specification | Yes | We evaluate QSDP for training GPT-scale LLMs using multiple cloud-grade Amazon EC2 p3dn.24xlarge machines, with 8 V100 SXM2 GPUs each. Each GPU has 32GB memory. |
| Software Dependencies | Yes | We use the official NGC Py Torch 22.05-py3 Docker image with Py Torch 1.12, CUDA 11.6.2, NCCL 2.12, and the Mosaic ML Composer library (version 0.12), as well as a fork of the CGX communication library (Markov et al., 2022). |
| Experiment Setup | Yes | The global batch size for 125M and 350M models was 256, for 1.3B 512, resulting in 4 gradient accumulations at each iteration. For all models Adam W optimizer was used, the optimizer parameters are presented in the Table 4. |