Communication Efficient Distributed Training with Distributed Lion

Authors: Bo Liu, Lemeng Wu, Lizhang Chen, Kaizhao Liang, Jiaxu Zhu, Chen Liang, Raghuraman Krishnamoorthi, Qiang Liu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical results demonstrate its robustness across a range of tasks, worker counts, and batch sizes, on both vision and language problems. In this section, we perform a thorough evaluation of the Distributed Lion algorithm, employing both the averaging and majority vote aggregation methods.
Researcher Affiliation Collaboration Bo Liu The University of Texas at Austin bliu@cs.utexas.edu, Lemeng Wu Meta AI lmwu@google.com, Lizhang Chen The University of Texas at Austin lzchen@utexas.edu, Kaizhao Liang The University of Texas at Austin kaizhaol@utexas.edu, Jiaxu Zhu Meta AI jiaxuzhu@meta.com, Chen Liang Google crazydonkey@google.com, Raghuraman Krishnamoorthi Meta AI raghuraman@meta.com, Qiang Liu The University of Texas at Austin lqiang@cs.utexas.edu
Pseudocode Yes Algorithm 1 Distributed Lion Training
Open Source Code No We will release code upon acceptance.
Open Datasets Yes We conduct experiments on the CIFAR-10 dataset... For the Image Net-1K benchmark... For pretraining language models on the Open Web Text dataset... (Citations: [31] Olga Russakovsky et al. Image Net Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211 252, 2015., [17] Aaron Gokaslan and Vanya Cohen. Openwebtext corpus. http://Skylion007.github.io/ Open Web Text Corpus, 2019.)
Dataset Splits Yes For language modeling experiments, we report the validation perplexity.
Hardware Specification No We provided how many workers are needed for each experiment, the GPU resource can be arbitrary as long as it fits in memory.
Software Dependencies No The paper does not provide specific software dependency names with their version numbers (e.g., 'PyTorch 1.9', 'Python 3.8').
Experiment Setup Yes Experiment Setup For Grad Drop, DGC, and Tern Grad, we choose the compression rate of 0.04 (note that 1/32 = 0.03125) to match the bandwidth of the D-Lion (Ma Vo). We conduct experiments on the CIFAR-10 dataset using a vision transformer (Vi T) with 6 layers, 8 heads, and a hidden dimension of 512... We list the optimal hyperparameters selected for each method from Figure 2 in Table 4. The learning rates are selected from {0.00005, 0.001, 0.005, 0.01} and the weight decays are selected from {0.0005, 0.001, 0.005}. For each experiment, we use a cosine learning rate scheduler and run for 200 epochs...