reproducibilityindex.ai

Quantized Distributed Training of Large Models with Convergence Guarantees

Authors: Ilia Markov, Adrian Vladu, Qi Guo, Dan Alistarh

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate this approach by training GPT-family models with up to 1.3 billion parameters on a multi-node cluster. Experiments show that QSDP preserves model accuracy, while completely removing the communication bottlenecks of FSDP, providing end-to-end speedups of up to 2.2x.
Researcher Affiliation	Academia	1Institute of Science and Technology Austria 2CNRS 3IRIF, Universit e Paris Cit e 4Max Planck Institute for Informatics.
Pseudocode	Yes	Algorithm 1 Gradient-based Optimization of the Levels
Open Source Code	No	The paper mentions using and building upon existing frameworks (e.g., PyTorch FSDP, CGX, MosaicML examples) but does not provide an explicit statement of code release or a direct link to their specific QSDP implementation code.
Open Datasets	Yes	on the C4 dataset (Raffel et al., 2020).
Dataset Splits	No	The paper mentions 'Validation perplexity' and uses the C4 dataset, which has standard splits, but it does not explicitly state the dataset split percentages or sample counts for training, validation, and testing within the text.
Hardware Specification	Yes	We evaluate QSDP for training GPT-scale LLMs using multiple cloud-grade Amazon EC2 p3dn.24xlarge machines, with 8 V100 SXM2 GPUs each. Each GPU has 32GB memory.
Software Dependencies	Yes	We use the official NGC Py Torch 22.05-py3 Docker image with Py Torch 1.12, CUDA 11.6.2, NCCL 2.12, and the Mosaic ML Composer library (version 0.12), as well as a fork of the CGX communication library (Markov et al., 2022).
Experiment Setup	Yes	The global batch size for 125M and 350M models was 256, for 1.3B 512, resulting in 4 gradient accumulations at each iteration. For all models Adam W optimizer was used, the optimizer parameters are presented in the Table 4.